Closed VannTen closed 1 year ago
/cc @harshad16 /sig devsecops
At the very least if we can't do that we should do a set difference (src - dest) to avoid checking for existence each time.
I'll also need to check if aws cli can take several documents at once in that case.
/kind bug
/triage needs-information @harshad16
thanks, @VannTen for looking into this. from the sync definition, it seems to be a good fit for this use case. we can use that as you noted.
Great !
I see two paths we can take,
Third path (impractical as a solution for now, but might be considered in the future): implementing sync functionality in upstream boto3 (see https://github.com/boto/boto3/issues/358#issuecomment-1156665630 and previous discussion)
Wdyt ?
Great option , great work. i would suggest starting with 1.
Will start on that :+1:
/assign /lifecycle active /triage accepted
(The PR is already done for some time this is just cleaning up issues)
- Use 'aws sync' via subprocess directly:
- straightforward
So, after some local testing, not that straightforward. I can't seem to make it sync two s3 URI (need to pass by a local folder) it does not avoid reuploading the same files (I found some issues on Github regarding this)
So, I'm gonna try another approach, checking in process which objects are already synced. And maybe replace the aws s3 cp calls with the equivalent in boto3 if it exists, to avoid the subprocess overhead.
This was prompted by https://github.com/thoth-station/thoth-application/issues/2604
We currently spawn two subprocess per document to sync, if I read app.py correctly. Could we not use
aws sync s3://src s3://dest
? Or is that too coarse grained ?