Closed avkrishnamurthy closed 1 year ago
A fine start.
Consider breaking out some of this stuff into a new file and separating the concerns a bit. We created the concept of a generic "scanner" with implementations for a filesystem and an S3 bucket. Perhaps the S3 scanner can call these new "upload" functions and we won't need as much if/else in here.
Adding @d-w-moore as a reviewer for expert review of the use of the parallel transfer mechanisms of PRC.
You are referring to the scanner.py, correct?
Yes, scanner.py. We may be able to have the scanner call these functions directly instead of routing things through upload_file
and (re-)determining if it's an S3 path or a filesystem path. I'm open to whatever way makes sense, though.
Added changes to have scanner do the uploading/syncing of objects. I tested it with register, put, put_sync, and put_append for both S3 files and non-S3 files, parallelized and non-parallelized, and it worked. If this is too big of a change for this PR/issue, I understand and can create a separate issue and PR for refactoring the scanner.py sync_task.py, and sync_irods.py if needed.
will look at today.
Pounds added
Excellent work!
This addresses #207. The
PUT
andPUT_SYNC
bugs for S3 objects were resolved in #206, but still run very slowly because the solution uses a single stream read and write from an S3 object to an iRODS collection. For larger files, this is a problem. This PR speeds up this process using multithreading to do parallel read and write from S3 into iRODS.