irods / irods_capability_automated_ingest

Other
12 stars 15 forks source link

add character_map as event_handler method #166

Closed trel closed 1 year ago

trel commented 2 years ago

Some scanned filesystems/sources may have undesirable characters in their directory names and/or filenames.

A character_map method defined in an event_handler could manipulate a set of characters into another set of characters.

This can be a one-to-one mapping - or perhaps to be more efficiently expressed, a many-to-one mapping.

This would replace any defined characters in the source filename when creating the target data object name.

An example could be:

"John O'Hare's M&Ms @ home.txt" -> "John_O_Hare_s_M-Ms_-_home.txt"

It would also base64 encode the original filename and encode this information as an AVU on the target data object:

A: irods::automated_ingest::character_map
V: base64-encoded string of original abspath
U: python3.base64.b64encode(full_path_of_original_file)

This is similar to how we already automatically handle UnicodeEncodeError (see https://github.com/irods/irods_capability_automated_ingest/commit/ed8d794ec5b1d2d592e6ab88953674df70187b6a).

Note... if two scanned filenames map into identical target data object names... first one wins? I think so, same as today.

multidict.py:

original_map = {
    ('#',"'",'^',"\\",'~'): '_',
    (chr(65535),chr(65534),'\n'): '-',
    '|': 'x',
    map(chr, range(65520, 65529)): 'y'
}
print(original_map)

expanded_map = {}
for k, v in original_map.items():
    if type(k) is str:
        k = [k]
    for key in k:
        expanded_map[key] = v

print(expanded_map)

produces...

$ python3 multidict.py 
{('\uffff', '\ufffe', '\n'): '-', <map object at 0x7f4ef27bd3c8>: 'y', '|': 'x', ('#', "'", '^', '\\', '~'): '_'}
{'\ufffe': '-', '\ufff3': 'y', '|': 'x', '\\': '_', '\ufff6': 'y', '\ufff8': 'y', '\ufff5': 'y', '#': '_', '\ufff2': 'y', "'": '_', '\ufff1': 'y', '\ufff4': 'y', '^': '_', '\uffff': '-', '\ufff7': 'y', '~': '_', '\n': '-', '\ufff0': 'y'}

so... we can do the expansion automatically and only document/expose the efficient syntax....

class event_handler(Core):
    def character_map(session, meta, **options):
        return {
               ('#',"'",'^',"\\",'~'): '_',
               (chr(65535),chr(65534),'\n'): '-',
               '|': 'x',
               map(chr, range(65520, 65529)): 'y'
               }
trel commented 2 years ago

another consideration is when a found/ingested symlink points to a file that gets character_map'd.

either a) do nothing (b/c these are independent scanning workers, up to later process to figure out what happened) b) update the data object's AVU to point to the new one instead (expensive, b/c query) c) update the data object's AVUs to point to the new one in addition to the original one (expensive, b/c query)

case symlink target desired effect
1 no-change no-change normal behavior
2 no-change mapped a or b or c (leaning a)
3 mapped no-change would be fine, no additional work
4 mapped mapped a or b or c (leaning a)

if a) is chosen... a later process could query for symlinks in the system, and potentially use the same character_map to derive any updated/mapped targets

1) xyz -> xyz ... symlink data object would have 'symlink AVU' pointing to the original target pathname. (already works)

2) xyz -> x^y ... symlink data object would have 'symlink AVU' pointing to the original target pathname. arguably the target symlink path should be safely encoded as well, rather than put in plaintext into the catalog (a separate enhancement)... https://github.com/irods/irods_capability_automated_ingest/blob/c60e2cc94f3ccb182c88db2c5bf7dff7b344ed9a/irods_capability_automated_ingest/sync_irods.py#L91-L93

3) x^y -> xyz .... symlink data object would have its name changed in the catalog, but have an AVU that includes its original name. it would have another AVU pointing to the target path (xyz)

4) x^y -> x^y .... symlink data object would have its name changed in the catalog, but have an AVU that includes its original name. it would have another AVU pointing to the original target pathname.. but this should also do the same as 2) above... and safely encode the target as well.

trel commented 2 years ago

We've established that meta['target'] is manipulatable in the 'pre' --event_handler methods - so there should be no additional moving parts necessary to get this implemented.

    # demonstrates changing/setting the target logical_path to uppercase
    def pre_data_obj_create(hdlr_mod, logger, session, meta, **options):
        import os
        logical_path = meta['target']
        logger.info('pre_data_obj_create:['+logical_path+']')
        new_logical_path = os.path.join(os.path.dirname(logical_path), os.path.basename(logical_path).upper())
        logger.info('pre_data_obj_create new:['+new_logical_path+']')
        meta['target'] = new_logical_path

It is also worth thinking about defining the character_map as what to keep, rather than defining what to change (aka "alphanumeric, underscores, and hyphens", and then for everything else, convert to underscore)

trel commented 2 years ago

One more consideration - if character_map is a separate event handler method... does it just always apply to pre_data_obj_create and pre_coll_create? Is that too magical? Should there be another knob/setting in the character_map definition to define when it applies? Or an annotation in the other methods to 'use' the character_map? Don't want to create two knobs that have to move together to work correctly.

trel commented 2 years ago

We need to decide whether the character_map is fired / has an effect BEFORE or AFTER any other defined pre_data_obj_create... aka whether the meta['target'] will already have been changed by the character_map. We are leaning towards character_map happening first.

And then... the AVU holding the character_map information would be inserted BEFORE any post_data_obj_create, as well?

And... the character_map could include function definitions and/or regular expressions as well as the literal syntax defined/shown above.

trel commented 2 years ago

Also need to consider whether the character_map should fire BEFORE or AFTER the internal checking for UnicodeEncodeError.

We are leaning towards 'AFTER' - which, if bad characters are found, would make character_map 'moot'... which is fine b/c UnicodeEncodeError is more bad and should be handled first.

d-w-moore commented 2 years ago

OK, actually, this type of filename (a utf8 encoding of '\u1000' truncated to two characters) should invoke the UnicodeError when scanned. Will try in the morning with 4.2.11, since 4.3.0 is possibly getting in the way of seeing them happen right now. If I'm right we should then keep the UnicodeEncodeError, even for Python3.7+, based on this.

#!/bin/bash
# - invoke UnicodeEncodeError for a scanned file
export FN="/tmp/pid_$$"
mkdir $FN
touch $FN/$'\341\200'
python3 -c 'import os
fname_itr = os.scandir(os.environ["FN"])
fname1 = next(fname_itr)
print(repr(fname1))
print(fname1.name.encode("utf8"))
alanking commented 1 year ago

Please close if complete, or let's leave an update on the next steps. Thanks!

d-w-moore commented 1 year ago

Ah... apparently I do not have privileges to close.

trel commented 1 year ago

I think this is complete - until we learn of a different set of requested requirements.