harvard-lil / perma

Indelible links
408 stars 72 forks source link

WARC to WACZ conversion #3488

Closed teovin closed 3 months ago

teovin commented 3 months ago

This is the first draft of the WARC to WACZ conversion experiment.

We have a sample of 1000 WARC files that I uploaded to my local minio instance using Ben's command here.

Task will get all the WARC names from the CSV file, get the corresponding WARC from storage, convert the file, and upload the resulting WACZ file to storage. It will also log the conversion duration, status, file size, and error log (if any) in a csv file.

Or optionally, the task can accept a single WARC argument and only process that one file.

Sample invocation:

docker compose exec web invoke dev.benchmark-wacz-conversion --source-csv='perma/wacz_experiment/1000-a-guids.csv' --benchmark-log='perma/wacz_experiment/benchmark.csv'

or

docker compose exec web invoke dev.benchmark-wacz-conversion --single-warc='A276-A9A4.warc.gz' --benchmark-log='perma/wacz_experiment/benchmark.csv'

I also replayed a few of them using Becky's replay changes here.

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 13.25301% with 72 lines in your changes are missing coverage. Please review.

Project coverage is 70.48%. Comparing base (efa78c0) to head (fbe925c). Report is 7 commits behind head on develop.

Files Patch % Lines
perma_web/perma/celery_tasks.py 17.85% 46 Missing :warning:
perma_web/tasks/dev.py 0.00% 25 Missing :warning:
perma_web/perma/models.py 50.00% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #3488 +/- ## =========================================== - Coverage 71.22% 70.48% -0.74% =========================================== Files 48 48 Lines 6512 6594 +82 =========================================== + Hits 4638 4648 +10 - Misses 1874 1946 +72 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

matteocargnelutti commented 3 months ago

Really cool, @teovin 🏄 !!

teovin commented 3 months ago

image