thoth-station / document-sync-job

Sync Thoth documents to an S3 API compatible remote
GNU General Public License v3.0
0 stars 5 forks source link

Calculate set difference to avoid unnecessary copy to S3 #37

Closed VannTen closed 1 year ago

VannTen commented 1 year ago

Intent: We list object from the source and the destination and calculate the set difference. Both query use the list_objects_v2 from boto3 since we're talking to s3 store. The keys are guaranteed (by the S3 api def) to be sorted, so we can do the set difference lazily by walking each generator (see sorted_set_diff). -> This is to avoid checking objects existence in the destination one by one

Also, we use boto3 directly for upload and download which will avoid the fork-exec and the need for a temporary file.

NOTE TO REVIEWERS: I've arranged the commits into logical chunks, so they should be mostly individuably reviewable.

EDIT: fix #32 fix thoth-station/thoth-application#2604

sesheta commented 1 year ago

Skipping CI for Draft Pull Request. If you want CI signal for your change, please convert it to an actual PR. You can still manually trigger a test run with /test all

VannTen commented 1 year ago

Intend to fix #32

VannTen commented 1 year ago

Should be ready for review o/

@harshad16 : do we have a test suite / test cluster to check the functionnality ? From what I've seen while coding, we probably had other problems than excessive memory or timeout (-> related to storages and handling solver prefixes)

VannTen commented 1 year ago

Also, this might be a candidate for upstreaming (into boto3 / botocore maybe). There is a feature request https://github.com/boto/boto3/issues/358 which is open for some time. @goern

goern commented 1 year ago

Also, this might be a candidate for upstreaming (into boto3 / botocore maybe). There is a feature request boto/boto3#358 which is open for some time. @goern

feel free to contribute back to upstream, if we can do it in a reasonable amount of time...

VannTen commented 1 year ago

I'll go probe them and see how that goes.

harshad16 commented 1 year ago

Should be ready for review o/

Can you please rebase your commit to 1? as there only one feature we are working on here.

do we have a test suite / test cluster to check the functionnality ? From what I've seen while coding, we probably had other problems than excessive memory or timeout (-> related to storages and handling solver prefixes)

we can test the functionality between the stage and semi-prod. i will update some documentation and share it soon.

VannTen commented 1 year ago

Can do. The diff is pretty big though, so I'll push another branch with the currents commits on work_avoidance_with_generators_splitted on my fork, if anyone wants to have a look a that.

VannTen commented 1 year ago

we can test the functionality between the stage and semi-prod. i will update some documentation and share it soon.

Any news on that ? @harshad16

harshad16 commented 1 year ago

i have updated the documentation here for the access: https://github.com/thoth-station/thoth-application/pull/2647 with this, what we can do is run this test job , with stage and semi-prod details.

harshad16 commented 1 year ago

will include the include the test step in this repo as well in a different pr shortly.

harshad16 commented 1 year ago

docs for testing https://github.com/thoth-station/document-sync-job/pull/41

VannTen commented 1 year ago

While testing I encountered some annoyances with packaging / having separate files.

Would it be a problem to pull in this PR a version of #40 ? This implies to modify the way the script is executed, but we can modify the manifest in thoth-application right ? @harshad16

harshad16 commented 1 year ago

Would it be a problem to pull in this PR a version of #40 ?

we should be able to do it.

This implies to modify the way the script is executed, but we can modify the manifest in thoth-application right ? @harshad16

yes we update it in thoth-application.

VannTen commented 1 year ago

The date filtering does not seem to work with the older version (aka using get_document_listing for listing objects keys) when I test it on data on upshift ("data/ocp4-test").

Makes sense since there is data not following what I can discern of the "path schema":

2022-09-26 07:25:45,643 1651 DEBUG    thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis-by-digest/sha256:4892016fc0cab7eafc8c8eabe49169ac4d8aa1e9a6985b0713e8c9cf3f582245' to dest (bucket=thoth) key '/test/pr/37/analysis-by-digest/sha256:4892016fc0cab7eafc8c8eabe49169ac4d8aa1e9a6985b0713e8c9cf3f582245'
2022-09-26 07:25:45,641 1651 DEBUG    thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis-by-digest/sha256:2e4bbb2be6e7aff711ddc93f0b07e49c93d41e4c2ffc8ea75f804ad6fe25564e' to dest (bucket=thoth) key '/test/pr/37/analysis-by-digest/sha256:2e4bbb2be6e7aff711ddc93f0b07e49c93d41e4c2ffc8ea75f804ad6fe25564e'
2022-09-26 07:25:45,642 1651 DEBUG    thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis-by-digest/sha256:31ccb79b1b2c2d6eff1bee0db23d5b8ab598eafd6238417d9813f1346f717c11' to dest (bucket=thoth) key '/test/pr/37/analysis-by-digest/sha256:31ccb79b1b2c2d6eff1bee0db23d5b8ab598eafd6238417d9813f1346f717c11'

...
2022-09-26 07:25:46,533 1651 DEBUG    thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis/package-extract-210216202423-73c76b4c118c34f4' to dest (bucket=thoth) key '/test/pr/37/analysis/package-extract-210216202423-73c76b4c118c34f4'
2022-09-26 07:25:46,533 1651 DEBUG    thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis/package-extract-210217000154-5b861347e1675df2' to dest (bucket=thoth) key '/test/pr/37/analysis/package-extract-210217000154-5b861347e1675df2'
2022-09-26 07:25:46,534 1651 DEBUG    thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis/package-extract-210220000034-6d6173ecc5b7ca9c' to dest (bucket=thoth) key '/test/pr/37/analysis/package-extract-210220000034-6d6173ecc5b7ca9c'

(on the last two it should be analysis-210220000034 as far as I can tell) For analysis-by-digest since there is no date at all obviously it can't work.

Am I testing this wrong ?

(I checked running get_document_listing, then I get the key. But when I add the start_date argument I get an empty result).


The underlying ask is do we need the prefix filtering on dates or could we drop that ?

@harshad16

VannTen commented 1 year ago

So, first benchmarks, I got that kind of result (using time)

Full sync (skipped are ".request" objects)

2022-09-27 15:11:35,493 12266 INFO     thoth.document_sync:242: Summary of the sync operation:
2022-09-27 15:11:35,493 12266 INFO     thoth.document_sync:244: analysis-by-digest: 38 success, 0 skipped, 0 error
2022-09-27 15:11:35,494 12266 INFO     thoth.document_sync:244: analysis: 947 success, 637 skipped, 0 error
2022-09-27 15:11:35,494 12266 INFO     thoth.document_sync:244: solver: 1156 success, 0 skipped, 0 error

real    2m46.263s
user    0m26.379s
sys 0m4.593s

No op (everything already synced)

2022-09-27 15:11:58,504 16651 INFO     thoth.document_sync:242: Summary of the sync operation:
2022-09-27 15:11:58,504 16651 INFO     thoth.document_sync:244: analysis-by-digest: 0 success, 38 skipped, 0 error
2022-09-27 15:11:58,505 16651 INFO     thoth.document_sync:244: analysis: 0 success, 1584 skipped, 0 error
2022-09-27 15:11:58,505 16651 INFO     thoth.document_sync:244: solver: 0 success, 1156 skipped, 0 error

real    0m6.791s
user    0m1.191s
sys 0m0.094s

I'll try to compare with the previous method later.

harshad16 commented 1 year ago

Based on the previous question, we would be only copying the data ingestion data most of the time, like solvers, package-extract extra, so we would like to follow the prefix as component-datetime-wfhash like package-extract-210220000034-6d6173ecc5b7ca9c we would like to work such pattern. thanks for all the testing and re-write this looks very promising.

VannTen commented 1 year ago

On Tue, Sep 27, 2022 at 12:43:20PM -0700, Harshad Reddy Nalla wrote:

Based on the previous question, we would be only copying the data ingestion data most of the time, like solvers, package-extract extra, so we would like to follow the prefix as component-datetime-wfhash like package-extract-210220000034-6d6173ecc5b7ca9c we would like to work such pattern.

Thing is, the prefix is not super consistent. For analysis, analysis-by-digest, solver we have that those key:

analysis: data/ocp4-test/analysis/package-extract-220928000012-7643569fa3a0f8b6 data/ocp4-test/analysis/package-extract-220928010100-aae17edb771ef3e -> this is "root_prefix/adapter.RESULT_TYPE/--"

analysis-by-digest: data/ocp4-test/analysis/sha256:f8db086dcbf028d9a1f23f3cf9173e1de087d3d8e72a592c92c9ab510e0352f3 data/ocp4-test/analysis/sha256:fa95fe99ae271240513a91c17fd2869936286eae5c18eded1149a1005f8db383 -> this is "root_prefix/adapter.RESULT_TYPE/:"

solver: data/ocp4-test/analysis/solver-rhel-8-py36-220919010145-3f978877a8795a80 data/ocp4-test/analysis/solver-rhel-8-py36-220920010531-f033c4208b47163e -> this is "root_prefix/adapter.RESULT_TYPE/--"

This was not was I expected from the code in storages (and indeed, get_document_listing does not seem to work when used with the start_date argument, as far as I can tell, at least for analysis/analysis-by-digest)

Solver is workable, but I don't know how to go from Analysis.RESULT_TYPE to "package-extract" for example.

we would be only copying the data ingestion data most of the time, If I follow correctly, this means only copying new data, right ? The current code would do that, filtering on date is an optimization at the start of the pipeline.

Current workflow (of this PR) is : source = dest = to_copy = source - dest (aka, set difference)

so if source == dest => to_copy == []

copy_from_source_to_dest(to_copy)

Date prefixing only concerns the first step (setting source and dest) (basically, the second case with no copying at all in my benchmarks posted yesterday). Comparing with empty source and destination:

$ time THOTH_DEPLOYMENT_NAME=rubbish THOTH_DOCUMENT_SYNC_DST=s3://thoth/test/pr/notapr/ document-sync-job --dry-run ... 2022-09-28 08:25:31,192 3373 INFO thoth.document_sync:242: Summary of the sync operation: 2022-09-28 08:25:31,192 3373 INFO thoth.document_sync:244: analysis-by-digest: 0 success, 0 skipped, 0 error 2022-09-28 08:25:31,192 3373 INFO thoth.document_sync:244: analysis: 0 success, 0 skipped, 0 error 2022-09-28 08:25:31,193 3373 INFO thoth.document_sync:244: solver: 0 success, 0 skipped, 0 error

real 0m2.829s user 0m0.798s sys 0m0.083s

vs

time document-sync-job --dry-run 2022-09-28 08:25:25,609 3358 INFO thoth.document_sync:242: Summary of the sync operation: 2022-09-28 08:25:25,609 3358 INFO thoth.document_sync:244: analysis-by-digest: 0 success, 38 skipped, 0 error 2022-09-28 08:25:25,609 3358 INFO thoth.document_sync:244: analysis: 2 success, 1585 skipped, 0 error 2022-09-28 08:25:25,610 3358 INFO thoth.document_sync:244: solver: 1 success, 1156 skipped, 0 error

real 0m5.728s user 0m1.225s sys 0m0.058s

So we have roughly 3 seconds of overhead for 2700 keys, give or take. That would require more testing to see how that scale though, not sure how much is in the listing requests exactly (which are paged, 1000 keys at a time)

VannTen commented 1 year ago

That's where having #2691 would be cool (date as a first class metadata, not part of the key-name itself)

VannTen commented 1 year ago

Results of benchmarks on current master with commit 75079167caac966fd0d362654d524cc174584c96 on top (for the url envvar):

(the output format is different because it used time binary instead instead of bash builtin, apparently using env parameters assignment does that)

If I remember correctly, elapsed maps to real, the rest is the same

No-op

time python app.py --debug ... 510.59user 160.82system 3:44.24elapsed 745%CPU (0avgtext+0avgdata 145836maxresident)k 88inputs+0outputs (33major+30686525minor)pagefaults 0swaps

Full sync

... time python app.py --debug 3076.99user 351.99system 8:36.97elapsed 663%CPU (0avgtext+0avgdata 259352maxresident)k 0inputs+0outputs (69major+66830397minor)pagefaults 0swaps


If my math is correct, that's 32 times faster in the no-op case and 3 times faster in the full-op case. (counting elapsed / real. I don't think we care about the other much because we're mainly doing IO)

I'll see tomorrow how the dates stuff perform (if it works in master, not sure about that).

VannTen commented 1 year ago
[max@work-laptop-max ~/redhat/thoth-station/document_sync_job] (aws_url_endpoint =)$ time THOTH_DOCUMENT_SYNC_DST=s3://thoth/test/pr/37 python app.py --days 1 --debug
2022-09-29 09:11:20,863 11951 WARNING  thoth.common:340: Logging to a Sentry instance is turned off
2022-09-29 09:11:20,864 11951 DEBUG    thoth.document_sync:176: Debug mode is on.
2022-09-29 09:11:20,864 11951 INFO     thoth.document_sync:178: Running document syncing job in version '0.3.1+storages0.73.4'
2022-09-29 09:11:20,864 11951 INFO     thoth.document_sync:183: Listing documents created since '2022-09-28'
2022-09-29 09:11:20,864 11951 INFO     thoth.document_sync:187: Listing 'analysis-by-digest' documents
2022-09-29 09:11:21,744 11951 INFO     thoth.document_sync:187: Listing 'analysis' documents
2022-09-29 09:11:22,522 11951 INFO     thoth.document_sync:204: Listing documents for solver 'solver-rhel-8-py36'
2022-09-29 09:11:23,233 11951 INFO     thoth.document_sync:204: Listing documents for solver 'solver-rhel-8-py38'
2022-09-29 09:11:23,590 11951 INFO     thoth.document_sync:225: Document sync job has finished successfully

real    0m3.481s
user    0m0.909s
sys 0m0.070s

So when using --days n the current master version does not find any documents it seems. (If I'm missing something please point out, this is my env:

Env ``` AWS_S3_ENDPOINT_URL=https://s3-openshift-storage.apps.smaug.na.operate-first.cloud THOTH_CEPH_BUCKET=thoth THOTH_CEPH_BUCKET_PREFIX=data THOTH_CEPH_SECRET_KEY=pdcEGFERILlkRDGrCSxdIMaZVtNCOKvYP4Gf2b2x PIPENV_ACTIVE=1 THOTH_S3_ENDPOINT_URL=https://s3.upshift.redhat.com/ THOTH_DOCUMENT_SYNC_DST=s3://thoth/test/pr/37/ EDITOR=vim PWD=/home/max/redhat/thoth-station/document_sync_job LOGNAME=max SYSTEMD_EXEC_PID=815 THOTH_CEPH_KEY_ID= THOTH_DEPLOYMENT_NAME=ocp4-test PIP_PYTHON_PATH=/usr/bin/python HOME=/home/max LANG=en_US.UTF-8 VIRTUAL_ENV=/home/max/.local/share/virtualenvs/document_sync_job-ykvVh148 AWS_SECRET_ACCESS_KEY= THOTH_DOCUMENT_SYNC_CONFIGURED_SOLVERS=solver-rhel-8-py36 solver-rhel-8-py38 PIP_DISABLE_PIP_VERSION_CHECK=1 AWS_ACCESS_KEY_ID= PYTHONDONTWRITEBYTECODE=1 PATH=/home/max/.local/share/virtualenvs/document_sync_job-ykvVh148/bin:/usr/local/bin:/usr/bin ```

Given this and the benchmarks, is it ok to drop the --days option for now ?

VannTen commented 1 year ago

This should be good for review.

The last commit removes the --days parameters. I put that in one commit on its own to easily revert it if needed.

VannTen commented 1 year ago

Good point on using thoth/document-sync :+1:

goern commented 1 year ago

/lgtm /assign @harshad16 /unassign @goern

sesheta commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harshad16

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/thoth-station/document-sync-job/blob/master/OWNERS)~~ [harshad16] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment