irods / irods_capability_automated_ingest

Other
12 stars 15 forks source link

exclude of files and directories not working #200

Closed rmoreas closed 1 year ago

rmoreas commented 1 year ago

The option exclude_directory_name to exclude directories in the sync using regex patterns seems not to work, for example:

exclude_directory_name = [ ".*/DA_.*_\+\.M" ]

Results in the following error:

[2023-06-13 17:21:30,661: WARNING/ForkPoolWorker-4] {"event": "calling [on_data_obj_create] in event handler: args = (<module 'event_handleragilent-lcms75de18ffd89a4d2ea2e084992430e316' from '/tmp/event_handleragilent-lcms75de18ffd89a4d2ea2e084992430e316.py'>, <BoundLoggerLazyProxy(logger=<Logger irods_sync//DEBUG (DEBUG)>, wrapper_class=None, processors=[<function filter_by_level at 0x7f07c1787430>, <function add_logger_name at 0x7f07c1787f70>, <function add_log_level at 0x7f07c17f1040>, <function timestamper at 0x7f07c1882b80>, <structlog.processors.JSONRenderer object at 0x7f07c17e5040>], context_class=None, initial_values={}, logger_factory_args=())>, <irods.session.iRODSSession object at 0x7f07c1463f10>, {'restart_queue': 'restart', 'path_queue': 'path', 'file_queue': 'file', 'src_path': '/data/Agilent_LCMS', 'target': '/set/home/set_pilot039/ingress/Agilent_LCMS/TEST-01/DA_FLUSH40_20MIN_+.M', 'interval': None, 'job_name': 'agilent-lcms', 'append_json': None, 'ignore_cache': False, 'initial_ingest': False, 'event_handler': '/home/mango/handlers/agilent_lcms.py', 'config': {'log': {'filename': None, 'when': None, 'interval': None, 'level': 'DEBUG'}, 'profile': {'filename': None, 'when': None, 'interval': None, 'level': None}, 'redis': {'host': 'mango-ingest-redis-master', 'port': '6379', 'db': '0'}}, 'synchronous': True, 'progress': False, 'profile': False, 'files_per_task': 50, 'exclude_file_type': [], 'exclude_file_name': ['.*/_INGEST_MANIFEST_.*\\\\.txt', '.*/test\\\\.hidden'], 'exclude_directory_name': ['.*/DA_.*_\\\\+\\\\.M', '.*/NV--.*\\\\.D'], 'idle_disconnect_seconds': 60, 's3_endpoint_domain': 's3.amazonaws.com', 's3_region_name': 'us-east-1', 's3_keypair': None, 's3_proxy_url': None, 's3_secure_connection': True, 'root': '/data/Agilent_LCMS', 'path': '/data/Agilent_LCMS/TEST-01/DA_FLUSH40_20MIN_+.M', 'event_handler_key': 'custom_event_handler:/agilent-lcms::75de18ffd89a4d2ea2e084992430e316', 'task': 'sync_file', 'queue_name': 'file', 'mtime': None, 'ctime': None, 'chunk': {'/data/Agilent_LCMS/TEST-01/DA_FLUSH40_20MIN_+.M': {}, '/data/Agilent_LCMS/TEST-01/NV--0501.D': {}, '/data/Agilent_LCMS/TEST-01/STD_SEQ.LOG': {'is_link': False, 'is_socket': False, 'mtime': 1686662877.4023976, 'ctime': 1686662877.4023976, 'size': 78}, '/data/Agilent_LCMS/TEST-01/STD_SEQ.TXT': {'is_link': False, 'is_socket': False, 'mtime': 1686662412.8858666, 'ctime': 1686662412.8858666, 'size': 0}, '/data/Agilent_LCMS/TEST-01/test.hidden': {}, '/data/Agilent_LCMS/TEST-01/_INGEST_MANIFEST_IRODS.txt': {}, '/data/Agilent_LCMS/TEST-01/_INGEST_MANIFEST_LOCAL.txt': {}}, 'is_empty_dir': None, 'is_link': None, 'is_socket': None, 'size': None}), options = {}", "logger": "irods_sync//DEBUG", "level": "debug", "@timestamp": "2023-06-13T17:21:30.661376+00:00"}
[2023-06-13 17:21:30,663: WARNING/ForkPoolWorker-4] {"event": "uploading object /set/home/set_pilot039/ingress/Agilent_LCMS/TEST-01/DA_FLUSH40_20MIN_+.M, options = {}", "logger": "irods_sync//DEBUG", "level": "info", "@timestamp": "2023-06-13T17:21:30.662814+00:00"}
[2023-06-13 17:21:30,674: WARNING/ForkPoolWorker-4] {"task": "sync_dir", "path": "/data/Agilent_LCMS/TEST-01", "job_name": "agilent-lcms", "task_id": "a48a56b8-0a0e-11ee-becb-a6b286d33cd6", "exc": "IsADirectoryError(21, 'Is a directory')", "einfo": "<ExceptionInfo: Retry(Retry(...), IsADirectoryError(21, 'Is a directory'), 5)>", "traceback": ["<FrameSummary file /usr/local/lib/python3.9/site-packages/irods_capability_automated_ingest/sync_task.py, line 312 in sync_entry>", "<FrameSummary file /usr/local/lib/python3.9/site-packages/irods_capability_automated_ingest/sync_irods.py, line 413 in sync_data_from_file>", "<FrameSummary file /usr/local/lib/python3.9/site-packages/irods_capability_automated_ingest/custom_event_handler.py, line 61 in call>", "<FrameSummary file /usr/local/lib/python3.9/site-packages/irods_capability_automated_ingest/core.py, line 7 in on_data_obj_create>", "<FrameSummary file /usr/local/lib/python3.9/site-packages/irods_capability_automated_ingest/sync_irods.py, line 144 in upload_file>", "<FrameSummary file /usr/local/lib/python3.9/site-packages/irods/manager/data_object_manager.py, line 146 in put>"], "event": "retry_task", "logger": "irods_sync//DEBUG", "level": "warning", "@timestamp": "2023-06-13T17:21:30.674595+00:00"}

The root cause seems to be at line 104 of irods_capability_automated_ingest/scanner.py. Even if the directory is matched with the exclude_directory_name regex pattern (self.exclude_file_type returns true), the path of the directory is returned by the itr function and added to the chunk that will be passed in the next call to enqueue_task(sync_files, sync_files_meta).

Even files that are matched with a exclude_file_name pattern are passed to enqueue_task(sync_files, sync_files_meta).

I think that in stead of returning the path of the excluded directory/file, filesystem_scanner.itr should raise ContinueException like this:

        try:
            if self.exclude_file_type(dir_regex, file_regex, full_path, logger, mode):
                raise ContinueException
rmoreas commented 1 year ago

Okay, now I see there is already a PR for this: #182

alanking commented 1 year ago

Leaving this open as I just realized I neglected to add a test for --exclude_file_type. Not a blocker for release, but just don't want to forget.

alanking commented 1 year ago

I only added a test for link files. Should we close this and open a new issue for the other types?

trel commented 1 year ago

let's open a different one for the other types.

alanking commented 1 year ago

Created https://github.com/irods/irods_capability_automated_ingest/issues/204 for the other file types. Closing.