irods / irods_capability_automated_ingest

Other
12 stars 15 forks source link

[#129] single stream put from S3 #206

Closed avkrishnamurthy closed 1 year ago

avkrishnamurthy commented 1 year ago

This fixes the "File not found error" that occurs when trying to run a PUT or PUT_SYNC for files from an S3 resource. I am using Minio to get the file from the S3 bucket. Currently can not use parallelization to do the put for S3 objects, which will need to be worked on. Not sure what to do with that at this point. I have also included checksum comparison with the Etag for both normal files in an S3 bucket and also multipart upload files. I am not sure if this is the correct way to do the checksum, so feedback there would be good. The default chunk size for multipart uploads to Amazon S3 is 8 MB, but I have read online that it may not always be the same, so I have added a command line argument to the ingest job to specify a different chunk size.

korydraughn commented 1 year ago

In the future, please leave off the pounds. We only add them once the changes have been reviewed and approved.

The reason for this is that if more changes are needed due to the review, pushing more commits causes the issue to be flooded with info about those new commits. So leaving the pounds off guards against that.

trel commented 1 year ago

if this PR is based on top of #134 then we should just add new commits, to preserve credit.

d-w-moore commented 1 year ago

Other than reaffirming @trel's statement above, this otherwise seems fine. I gather that it is a temporary substitute for the "complete" PUT from S3 solution we discussed yesterday, which will ultimately be multi-read/multi-write (NxN threads) from S3 to iRODS ?

avkrishnamurthy commented 1 year ago

Other than reaffirming @trel's statement above, this otherwise seems fine. I gather that it is a temporary substitute for the "complete" PUT from S3 solution we discussed yesterday, which will ultimately be multi-read/multi-write (NxN threads) from S3 to iRODS ?

Yes, this is just the initial naive solution using just single stream to stream.

trel commented 1 year ago

and i'm happy to get this merged as stream-to-stream, and then we can optimize in a new issue.

alanking commented 1 year ago

We are working on getting a branch which preserves the original authorship and builds on top of it. Stand by...

trel commented 1 year ago

oooh, git school is in.

alanking commented 1 year ago

Force-push seems to have been a success.

alanking commented 1 year ago

@avkrishnamurthy - As you address review comments, please click "Resolve conversation" so that we don't have to think about them anymore. If something was not satisfactorily addressed, we will start a new conversation. Thanks!

avkrishnamurthy commented 1 year ago

Getting closer...

Added changes

JustinKyleJames commented 1 year ago

This looks fine to me as long as it has been tested and performs as expected.

avkrishnamurthy commented 1 year ago

Everything looks good to me!

Perhaps we squash this down to 2-3 commits (omitting #s for now) and have one last look.

squashed

avkrishnamurthy commented 1 year ago

Looks good to me. Let's add the #s

I added the pounds to the 2 commits and to the PR title. Let me know if I need to change anything else.