harvard-lil / perma

Indelible links
408 stars 72 forks source link

Validate the WACZ produced during the conversion #3536

Closed rebeccacremona closed 4 weeks ago

rebeccacremona commented 4 weeks ago

This is a very basic check to make sure that the WACZ came out okay, during our test/benchmarking WARC -> WACZ conversion experiment.

Check the hash of the original WARC against the hash of the WARC inside the WACZ. This won't tell us whether the archive plays back, but will at least ensure that we didn't lose any valuable data.

To find the remote hash and calculate the new one, I tweaked the logic of Ben's awesome check_s3_hashes task, written back when to help us migrate from local filesystem storage to S3 ❤️.

See ENG-929.

codecov[bot] commented 4 weeks ago

Codecov Report

Attention: Patch coverage is 8.00000% with 23 lines in your changes missing coverage. Please review.

Project coverage is 69.23%. Comparing base (ab146f3) to head (913eeba). Report is 14 commits behind head on develop.

Files Patch % Lines
perma_web/perma/celery_tasks.py 8.00% 23 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #3536 +/- ## =========================================== - Coverage 69.50% 69.23% -0.28% =========================================== Files 48 48 Lines 6788 6816 +28 =========================================== + Hits 4718 4719 +1 - Misses 2070 2097 +27 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rebeccacremona commented 4 weeks ago

Thanks for the review, @cmsetzer! Suggested tweaks forthcoming!

rebeccacremona commented 4 weeks ago

Huh, why does flake8 in CI think that is a syntax error.... https://github.com/harvard-lil/perma/actions/runs/9422297235/job/25958258797?pr=3536

rebeccacremona commented 4 weeks ago

I wonder which version of python the linting step is running. It must be an old one...

rebeccacremona commented 4 weeks ago

Yep.

Ubuntu 20.04, which is running Python 3.8.10 https://github.com/actions/runner-images/blob/ubuntu20/20240603.1/images/ubuntu/Ubuntu2004-Readme.md

This style was added in 3.10.

cmsetzer commented 4 weeks ago

My bad! I thought we'd be solid since the app itself is running on our 3.11 image. Curious that the tests seem to run fine on Ubuntu 20.04, but maybe the new task code is just not getting imported.

rebeccacremona commented 4 weeks ago

Yeah I didn't realize that the linting workflow uses the system python, rather than the python in the container! I think no big deal to upgrade! I might try upgrading the test workflow to a newer unbuntu some other time, since that one is so much more complex.

sentry-io[bot] commented 2 weeks ago

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

Did you find this useful? React with a 👍 or 👎