Closed VannTen closed 1 year ago
Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all
Intend to fix #32
Should be ready for review o/
@harshad16 : do we have a test suite / test cluster to check the functionnality ? From what I've seen while coding, we probably had other problems than excessive memory or timeout (-> related to storages and handling solver prefixes)
Also, this might be a candidate for upstreaming (into boto3 / botocore maybe). There is a feature request https://github.com/boto/boto3/issues/358 which is open for some time. @goern
Also, this might be a candidate for upstreaming (into boto3 / botocore maybe). There is a feature request boto/boto3#358 which is open for some time. @goern
feel free to contribute back to upstream, if we can do it in a reasonable amount of time...
I'll go probe them and see how that goes.
Should be ready for review o/
Can you please rebase your commit to 1? as there only one feature we are working on here.
do we have a test suite / test cluster to check the functionnality ? From what I've seen while coding, we probably had other problems than excessive memory or timeout (-> related to storages and handling solver prefixes)
we can test the functionality between the stage and semi-prod. i will update some documentation and share it soon.
Can do. The diff is pretty big though, so I'll push another branch with the currents commits on work_avoidance_with_generators_splitted on my fork, if anyone wants to have a look a that.
we can test the functionality between the stage and semi-prod. i will update some documentation and share it soon.
Any news on that ? @harshad16
i have updated the documentation here for the access: https://github.com/thoth-station/thoth-application/pull/2647 with this, what we can do is run this test job , with stage and semi-prod details.
will include the include the test step in this repo as well in a different pr shortly.
docs for testing https://github.com/thoth-station/document-sync-job/pull/41
While testing I encountered some annoyances with packaging / having separate files.
Would it be a problem to pull in this PR a version of #40 ? This implies to modify the way the script is executed, but we can modify the manifest in thoth-application right ? @harshad16
Would it be a problem to pull in this PR a version of #40 ?
we should be able to do it.
This implies to modify the way the script is executed, but we can modify the manifest in thoth-application right ? @harshad16
yes we update it in thoth-application.
The date filtering does not seem to work with the older version (aka using get_document_listing for listing objects keys) when I test it on data on upshift ("data/ocp4-test").
Makes sense since there is data not following what I can discern of the "path schema":
2022-09-26 07:25:45,643 1651 DEBUG thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis-by-digest/sha256:4892016fc0cab7eafc8c8eabe49169ac4d8aa1e9a6985b0713e8c9cf3f582245' to dest (bucket=thoth) key '/test/pr/37/analysis-by-digest/sha256:4892016fc0cab7eafc8c8eabe49169ac4d8aa1e9a6985b0713e8c9cf3f582245'
2022-09-26 07:25:45,641 1651 DEBUG thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis-by-digest/sha256:2e4bbb2be6e7aff711ddc93f0b07e49c93d41e4c2ffc8ea75f804ad6fe25564e' to dest (bucket=thoth) key '/test/pr/37/analysis-by-digest/sha256:2e4bbb2be6e7aff711ddc93f0b07e49c93d41e4c2ffc8ea75f804ad6fe25564e'
2022-09-26 07:25:45,642 1651 DEBUG thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis-by-digest/sha256:31ccb79b1b2c2d6eff1bee0db23d5b8ab598eafd6238417d9813f1346f717c11' to dest (bucket=thoth) key '/test/pr/37/analysis-by-digest/sha256:31ccb79b1b2c2d6eff1bee0db23d5b8ab598eafd6238417d9813f1346f717c11'
...
2022-09-26 07:25:46,533 1651 DEBUG thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis/package-extract-210216202423-73c76b4c118c34f4' to dest (bucket=thoth) key '/test/pr/37/analysis/package-extract-210216202423-73c76b4c118c34f4'
2022-09-26 07:25:46,533 1651 DEBUG thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis/package-extract-210217000154-5b861347e1675df2' to dest (bucket=thoth) key '/test/pr/37/analysis/package-extract-210217000154-5b861347e1675df2'
2022-09-26 07:25:46,534 1651 DEBUG thoth.document_sync:157: Syncing source key (bucket='thoth') 'data/ocp4-test/analysis/package-extract-210220000034-6d6173ecc5b7ca9c' to dest (bucket=thoth) key '/test/pr/37/analysis/package-extract-210220000034-6d6173ecc5b7ca9c'
(on the last two it should be analysis-210220000034
as far as I can tell)
For analysis-by-digest since there is no date at all obviously it can't work.
Am I testing this wrong ?
(I checked running get_document_listing, then I get the key. But when I add the start_date argument I get an empty result).
The underlying ask is do we need the prefix filtering on dates or could we drop that ?
@harshad16
So, first benchmarks, I got that kind of result (using time
)
Full sync (skipped are ".request" objects)
2022-09-27 15:11:35,493 12266 INFO thoth.document_sync:242: Summary of the sync operation:
2022-09-27 15:11:35,493 12266 INFO thoth.document_sync:244: analysis-by-digest: 38 success, 0 skipped, 0 error
2022-09-27 15:11:35,494 12266 INFO thoth.document_sync:244: analysis: 947 success, 637 skipped, 0 error
2022-09-27 15:11:35,494 12266 INFO thoth.document_sync:244: solver: 1156 success, 0 skipped, 0 error
real 2m46.263s
user 0m26.379s
sys 0m4.593s
No op (everything already synced)
2022-09-27 15:11:58,504 16651 INFO thoth.document_sync:242: Summary of the sync operation:
2022-09-27 15:11:58,504 16651 INFO thoth.document_sync:244: analysis-by-digest: 0 success, 38 skipped, 0 error
2022-09-27 15:11:58,505 16651 INFO thoth.document_sync:244: analysis: 0 success, 1584 skipped, 0 error
2022-09-27 15:11:58,505 16651 INFO thoth.document_sync:244: solver: 0 success, 1156 skipped, 0 error
real 0m6.791s
user 0m1.191s
sys 0m0.094s
I'll try to compare with the previous method later.
Based on the previous question, we would be only copying the data ingestion data most of the time, like solvers, package-extract extra, so we would like to follow the prefix as component-datetime-wfhash
like package-extract-210220000034-6d6173ecc5b7ca9c
we would like to work such pattern.
thanks for all the testing and re-write this looks very promising.
On Tue, Sep 27, 2022 at 12:43:20PM -0700, Harshad Reddy Nalla wrote:
Based on the previous question, we would be only copying the data ingestion data most of the time, like solvers, package-extract extra, so we would like to follow the prefix as
component-datetime-wfhash
likepackage-extract-210220000034-6d6173ecc5b7ca9c
we would like to work such pattern.
Thing is, the prefix is not super consistent. For analysis, analysis-by-digest, solver we have that those key:
analysis:
data/ocp4-test/analysis/package-extract-220928000012-7643569fa3a0f8b6
data/ocp4-test/analysis/package-extract-220928010100-aae17edb771ef3e
-> this is "root_prefix/adapter.RESULT_TYPE/
analysis-by-digest:
data/ocp4-test/analysis/sha256:f8db086dcbf028d9a1f23f3cf9173e1de087d3d8e72a592c92c9ab510e0352f3
data/ocp4-test/analysis/sha256:fa95fe99ae271240513a91c17fd2869936286eae5c18eded1149a1005f8db383
-> this is "root_prefix/adapter.RESULT_TYPE/
solver:
data/ocp4-test/analysis/solver-rhel-8-py36-220919010145-3f978877a8795a80
data/ocp4-test/analysis/solver-rhel-8-py36-220920010531-f033c4208b47163e
-> this is "root_prefix/adapter.RESULT_TYPE/
This was not was I expected from the code in storages (and indeed, get_document_listing does not seem to work when used with the start_date argument, as far as I can tell, at least for analysis/analysis-by-digest)
Solver is workable, but I don't know how to go from Analysis.RESULT_TYPE to "package-extract" for example.
we would be only copying the data ingestion data most of the time, If I follow correctly, this means only copying new data, right ? The current code would do that, filtering on date is an optimization at the start of the pipeline.
Current workflow (of this PR) is :
source =
copy_from_source_to_dest(to_copy)
Date prefixing only concerns the first step (setting source and dest) (basically, the second case with no copying at all in my benchmarks posted yesterday). Comparing with empty source and destination:
$ time THOTH_DEPLOYMENT_NAME=rubbish THOTH_DOCUMENT_SYNC_DST=s3://thoth/test/pr/notapr/ document-sync-job --dry-run ... 2022-09-28 08:25:31,192 3373 INFO thoth.document_sync:242: Summary of the sync operation: 2022-09-28 08:25:31,192 3373 INFO thoth.document_sync:244: analysis-by-digest: 0 success, 0 skipped, 0 error 2022-09-28 08:25:31,192 3373 INFO thoth.document_sync:244: analysis: 0 success, 0 skipped, 0 error 2022-09-28 08:25:31,193 3373 INFO thoth.document_sync:244: solver: 0 success, 0 skipped, 0 error
real 0m2.829s user 0m0.798s sys 0m0.083s
vs
time document-sync-job --dry-run 2022-09-28 08:25:25,609 3358 INFO thoth.document_sync:242: Summary of the sync operation: 2022-09-28 08:25:25,609 3358 INFO thoth.document_sync:244: analysis-by-digest: 0 success, 38 skipped, 0 error 2022-09-28 08:25:25,609 3358 INFO thoth.document_sync:244: analysis: 2 success, 1585 skipped, 0 error 2022-09-28 08:25:25,610 3358 INFO thoth.document_sync:244: solver: 1 success, 1156 skipped, 0 error
real 0m5.728s user 0m1.225s sys 0m0.058s
So we have roughly 3 seconds of overhead for 2700 keys, give or take. That would require more testing to see how that scale though, not sure how much is in the listing requests exactly (which are paged, 1000 keys at a time)
That's where having #2691 would be cool (date as a first class metadata, not part of the key-name itself)
Results of benchmarks on current master with commit 75079167caac966fd0d362654d524cc174584c96 on top (for the url envvar):
(the output format is different because it used time binary instead instead of bash builtin, apparently using env parameters assignment does that)
If I remember correctly, elapsed maps to real, the rest is the same
time python app.py --debug ... 510.59user 160.82system 3:44.24elapsed 745%CPU (0avgtext+0avgdata 145836maxresident)k 88inputs+0outputs (33major+30686525minor)pagefaults 0swaps
... time python app.py --debug 3076.99user 351.99system 8:36.97elapsed 663%CPU (0avgtext+0avgdata 259352maxresident)k 0inputs+0outputs (69major+66830397minor)pagefaults 0swaps
If my math is correct, that's 32 times faster in the no-op case and 3 times faster in the full-op case. (counting elapsed / real. I don't think we care about the other much because we're mainly doing IO)
I'll see tomorrow how the dates stuff perform (if it works in master, not sure about that).
[max@work-laptop-max ~/redhat/thoth-station/document_sync_job] (aws_url_endpoint =)$ time THOTH_DOCUMENT_SYNC_DST=s3://thoth/test/pr/37 python app.py --days 1 --debug
2022-09-29 09:11:20,863 11951 WARNING thoth.common:340: Logging to a Sentry instance is turned off
2022-09-29 09:11:20,864 11951 DEBUG thoth.document_sync:176: Debug mode is on.
2022-09-29 09:11:20,864 11951 INFO thoth.document_sync:178: Running document syncing job in version '0.3.1+storages0.73.4'
2022-09-29 09:11:20,864 11951 INFO thoth.document_sync:183: Listing documents created since '2022-09-28'
2022-09-29 09:11:20,864 11951 INFO thoth.document_sync:187: Listing 'analysis-by-digest' documents
2022-09-29 09:11:21,744 11951 INFO thoth.document_sync:187: Listing 'analysis' documents
2022-09-29 09:11:22,522 11951 INFO thoth.document_sync:204: Listing documents for solver 'solver-rhel-8-py36'
2022-09-29 09:11:23,233 11951 INFO thoth.document_sync:204: Listing documents for solver 'solver-rhel-8-py38'
2022-09-29 09:11:23,590 11951 INFO thoth.document_sync:225: Document sync job has finished successfully
real 0m3.481s
user 0m0.909s
sys 0m0.070s
So when using --days n
the current master version does not find any documents
it seems. (If I'm missing something please point out, this is my env:
Given this and the benchmarks, is it ok to drop the --days
option for now ?
This should be good for review.
The last commit removes the --days
parameters. I put that in one commit on
its own to easily revert it if needed.
Good point on using thoth/document-sync
:+1:
/lgtm /assign @harshad16 /unassign @goern
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: harshad16
The full list of commands accepted by this bot can be found here.
The pull request process is described here
Intent: We list object from the source and the destination and calculate the set difference. Both query use the list_objects_v2 from boto3 since we're talking to s3 store. The keys are guaranteed (by the S3 api def) to be sorted, so we can do the set difference lazily by walking each generator (see
sorted_set_diff
). -> This is to avoid checking objects existence in the destination one by oneAlso, we use boto3 directly for upload and download which will avoid the fork-exec and the need for a temporary file.
NOTE TO REVIEWERS: I've arranged the commits into logical chunks, so they should be mostly individuably reviewable.
EDIT: fix #32 fix thoth-station/thoth-application#2604