irods / irods_capability_automated_ingest

Other
12 stars 16 forks source link

166.m - implement character remapping #176

Closed d-w-moore closed 2 years ago

d-w-moore commented 2 years ago

A keen observer might notice that we're being very inefficient by always encoding the original string using SHA1 to get the "unique-making suffix"... and they'd be right. We can only have MAX_PATH_ALLOWED (=1024) characters in an object path ( defined in rodsDef.h) and don't want to frivolously overstep that limitation. The suffix we encode could be potentially much shorter and encode only the diff between original and transformed string. That is, we could use something like:

ascii_encode = base64.b32encode # a decent default if as many as 32 alpha-nums "map to themselves"
diff = [ x[0] for x in zip(orig_objname,transformed_objname) if x[0] != x[1] ]
suffix = ascii_encode(b"".join( [u.encode('utf8') for u in diff] ))

Although to generate guaranteed-unique paths (ie collision free) we'd need to factor in the position of each substitution, likely using an enumerate( ) inside the zip ( ) - i.e., something like

diffs_withpos = [ x[0] for x in zip(enumerate(orig),xfrm) if x[0][1] != x[1] ]

with an extra stage of processing to encode position inserted between:

diff_bytes=[ bytes((x[0],))+x[1].encode('utf8') for x in diffs_withpos]
suffix = ascii_encode(b"".join(diff_bytes))

Even then, SHA1 might become the fallback for cases where:

For this reason, we'll wait til after the prototype is done for the more sensible / robust version.

d-w-moore commented 2 years ago

I am considering whether to remove (or deprecate for later removal?) the feature introduced in this commit: ed8d794 . ( The original issue solved by that commit is here: https://github.com/irods/irods_capability_automated_ingest/issues/40#issue-354053528 .)

It may make sense to remove the feature immediately, since it seems no longer necessary, and anyway are not certain how the character remapping (the subject of the discussion within this pull request) should interact with it, were we to leave it in....

My rationale:

In the #40 problem description, it is mentioned that the presence of certain characters in paths generated by the directory scanner can cause a UnicodeEncodeError when they form a string outside the valid UTF-8 encoding space. An example would be a file created with, for example, open(b"name"+bytes([B]),'w') where 128 <= B < 256.

In modern OS'es and Python versions however, os.scandir seems to use the 'surrogateescape' error handler when translating such filenames, so that a filename with binary encoding b'name\x80' becomes u'name\udc80' when decoded using UTF-8 (and thus iRODS will register it without incident). This theory is corroborated by the fact that the UnicodeEncodeError is now seemingly impossible to reproduce on any OS or Python version supported by the automated ingest.

As for troublesome characters outside of the whole issue of the UnicodeEncodeError (#40's sole domain and concern), the char_mapping feature to be introduced by this PR ( or even the use of the QUASI_XML parser in the PRC) can be used to obviate that problem.

Yes, i realize that other scanners such as S3's will have to be tested independently to make sure they don't blow up on certain filenames.

Thoughts @alanking @korydraughn @trel @SwooshyCueb ?

d-w-moore commented 2 years ago

To lay a little more solid foundation for the discussion just above, run the script here and observe the output. Notice that all those weird final characters (byte 128-255) are translated to their Unicode surrogate codepoints \uDC80-\uDCFF , as shown in the output of the script.

#!/usr/bin/python3
import os
import tempfile

xdir = tempfile.mkdtemp(dir=b'.')

unicode_xdir = xdir.decode('utf8')

for b in (bytes([x]) for x in range(126,256)):
    with tempfile.NamedTemporaryFile(dir=xdir,suffix=b'_'+b):
        _file = next(os.scandir(unicode_xdir))
        print(repr( _file.name ))
d-w-moore commented 2 years ago

What I'm assuming, and what I suppose is the thing that needs to be thoroughly tested, is that all modern filesystems and their subobjects like to be queried as Unicode (UTF-8 encoded) entities, and that when we defeat that by directly constructing certain filenames as arbitrary byte-strings, the appropriate characters are given the surrogate value when those byte-strings don't fit with the UTF-8 encoding.

Oh, and also, that iRODS will always do the right thing, even when handed those Unicode surrogate values. But I think that part's a no - brainer, even as it should of course be part of the test code we eventually write to satisfy all the coverage requirements.

trel commented 2 years ago

one indentation comment left

and then resolve the ones that are done. looking pretty good.

d-w-moore commented 2 years ago

Putting it through one more test before merge.

trel commented 2 years ago

Click resolve above on two comments and squash when ready. No # yet.

d-w-moore commented 2 years ago

Squashed with no #

trel commented 2 years ago

and squash again when ready. still no #.

d-w-moore commented 2 years ago

Ok, done.

trel commented 2 years ago

alright dan, please # em.

d-w-moore commented 2 years ago

done!