irods / irods_capability_automated_ingest

Other
12 stars 15 forks source link

[#207] multi read and write from S3 to iRODS for put, putsync #209

Closed avkrishnamurthy closed 1 year ago

avkrishnamurthy commented 1 year ago

This addresses #207. The PUT and PUT_SYNC bugs for S3 objects were resolved in #206, but still run very slowly because the solution uses a single stream read and write from an S3 object to an iRODS collection. For larger files, this is a problem. This PR speeds up this process using multithreading to do parallel read and write from S3 into iRODS.

avkrishnamurthy commented 1 year ago

A fine start.

Consider breaking out some of this stuff into a new file and separating the concerns a bit. We created the concept of a generic "scanner" with implementations for a filesystem and an S3 bucket. Perhaps the S3 scanner can call these new "upload" functions and we won't need as much if/else in here.

Adding @d-w-moore as a reviewer for expert review of the use of the parallel transfer mechanisms of PRC.

You are referring to the scanner.py, correct?

alanking commented 1 year ago

Yes, scanner.py. We may be able to have the scanner call these functions directly instead of routing things through upload_file and (re-)determining if it's an S3 path or a filesystem path. I'm open to whatever way makes sense, though.

avkrishnamurthy commented 1 year ago

Added changes to have scanner do the uploading/syncing of objects. I tested it with register, put, put_sync, and put_append for both S3 files and non-S3 files, parallelized and non-parallelized, and it worked. If this is too big of a change for this PR/issue, I understand and can create a separate issue and PR for refactoring the scanner.py sync_task.py, and sync_irods.py if needed.

d-w-moore commented 1 year ago

will look at today.

avkrishnamurthy commented 1 year ago

Pounds added

korydraughn commented 1 year ago

Excellent work!