harvard-lil / perma

Indelible links
420 stars 71 forks source link

Upload files to existing "daily" digest IA items, Part I #3238

Closed rebeccacremona closed 1 year ago

rebeccacremona commented 1 year ago

Context

We are reorganizing Perma's Internet Archive collection. In the past, we would add an "Item" with a single "File" to the collection for each new Perma Link; going forward, we've been asked to instead create one digest-like Item per day, with Files for each of the Perma Links created on that day.

See this internal document for a complete description of the project; see also its project board.

This PR

There are some 162,082 Perma Links that we would have expected to find uploaded to existing "daily" Internet Archive items, but whose WARCs are not present. For example, https://perma.cc/06N5Qy9rxNE's WARC is not included in https://archive.org/download/daily_perma_cc_2013-11-13.

This PR adds Celery tasks for uploading those links to the appropriate daily item.

It does -NOT- create new items in IA, if the appropriate daily item does not already exist.

This is intended as a gentle way to try out our upload code and test parallelization and the handling of rate-limiting. It will likely require tweaking, once we see how it works under real conditions.

Deploying

There's nothing special required for deployment, though we may want to have another look at all the IA-related settings and make sure we like them. There are now a number of configurable retry limits that could want tweaking.

Before running against any sizeable number of files, we probably want to let IA know that we are resuming uploads, experimentally.

To queue uploads, in the Django shell, run:

from perma.tasks import upload_missing_files_to_internet_archive
upload_missing_files_to_internet_archive(limit)

where limit is the maximum number of links you want to enqueue.

codecov[bot] commented 1 year ago

Codecov Report

Merging #3238 (e1e075f) into develop (c632594) will decrease coverage by 2.69%. The diff coverage is 9.54%.

@@             Coverage Diff             @@
##           develop    #3238      +/-   ##
===========================================
- Coverage    81.91%   79.22%   -2.70%     
===========================================
  Files           52       53       +1     
  Lines         5918     6143     +225     
===========================================
+ Hits          4848     4867      +19     
- Misses        1070     1276     +206     
Impacted Files Coverage Δ
perma_web/perma/tasks.py 53.28% <6.70%> (-7.17%) :arrow_down:
perma_web/perma/utils.py 66.33% <10.44%> (-10.93%) :arrow_down:
perma_web/perma/models.py 90.91% <33.33%> (+<0.01%) :arrow_up:
perma_web/perma/views/user_management.py 95.18% <50.00%> (+0.69%) :arrow_up:
perma_web/perma/admin.py 88.10% <60.00%> (-0.42%) :arrow_down:
perma_web/urls.py 100.00% <0.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.