thoth-station / document-sync-job

Sync Thoth documents to an S3 API compatible remote
GNU General Public License v3.0
0 stars 5 forks source link

Efficiency concerns #32

Closed VannTen closed 1 year ago

VannTen commented 1 year ago

This was prompted by https://github.com/thoth-station/thoth-application/issues/2604

We currently spawn two subprocess per document to sync, if I read app.py correctly. Could we not use aws sync s3://src s3://dest ? Or is that too coarse grained ?

VannTen commented 1 year ago

/cc @harshad16 /sig devsecops

VannTen commented 1 year ago

At the very least if we can't do that we should do a set difference (src - dest) to avoid checking for existence each time.

I'll also need to check if aws cli can take several documents at once in that case.

VannTen commented 1 year ago

/kind bug

VannTen commented 1 year ago

/triage needs-information @harshad16

harshad16 commented 1 year ago

thanks, @VannTen for looking into this. from the sync definition, it seems to be a good fit for this use case. we can use that as you noted.

VannTen commented 1 year ago

Great !

I see two paths we can take,

  1. Use 'aws sync' via subprocess directly:
    • straightforward
  2. Completely discarding document-sync-job and use the aws cli in the image:
    • less code ownership
    • we probably loose the metrics send to the pushgateway, so might not be ideal.

Third path (impractical as a solution for now, but might be considered in the future): implementing sync functionality in upstream boto3 (see https://github.com/boto/boto3/issues/358#issuecomment-1156665630 and previous discussion)

Wdyt ?

harshad16 commented 1 year ago

Great option , great work. i would suggest starting with 1.

VannTen commented 1 year ago

Will start on that :+1:

VannTen commented 1 year ago

/assign /lifecycle active /triage accepted

(The PR is already done for some time this is just cleaning up issues)

VannTen commented 1 year ago
  1. Use 'aws sync' via subprocess directly:
    • straightforward

So, after some local testing, not that straightforward. I can't seem to make it sync two s3 URI (need to pass by a local folder) it does not avoid reuploading the same files (I found some issues on Github regarding this)

So, I'm gonna try another approach, checking in process which objects are already synced. And maybe replace the aws s3 cp calls with the equivalent in boto3 if it exists, to avoid the subprocess overhead.