scientist-softserv / ams

Archival Management System to support the American Archive of Public Broadcasting
GNU General Public License v3.0
1 stars 0 forks source link

Spike: Bulkrax vs Hyrax Batch Ingest Estimate #69

Closed ShanaLMoore closed 1 year ago

ShanaLMoore commented 1 year ago

Summary

This ticket is to explore the current state of things.

ref convo/thread: https://assaydepot.slack.com/archives/C0313NKG2DA/p1692113998641919

Additional Information

https://assaydepot.slack.com/archives/C0313NK6HB6/p1692133959198219 https://github.com/scientist-softserv/dev-ops/issues/729 https://drive.google.com/drive/folders/1yOx1jW4WBXjk3_zAI5CHfen-PDkXoXV9 https://assaydepot.slack.com/archives/C030UPFFP2A/p1689857107484079?thread_ts=1689782401.682349&cid=C030UPFFP2A

Huddle NOTES:

did we implement export for GBH? this would require a bulkrax upgrade link disabled in their UI Miranda doesn't use bulkrax at all. look in current version of bulkrax to see if search based export functionality exists. spin up GBH and see if there's anything in the UI that hints of this functionality. get an estimate of how much it will take to implement this with bulkrax. create a batch ingest export w the found set. import it. ask tim or miranda how this currently works. if it's not broken we may proceed with keeping batch ingest. ie: what all interviews by Jamie Oliver. Be able to search for Jamie oliver, export the results. export a collection of work ids after we figure out which collection of work ids we need.

DOCS

https://drive.google.com/drive/folders/1pSELhG57A7S4Cy1YiARHn0NwfVE0qrNj

ShanaLMoore commented 1 year ago

Bulkrax

LOCALLY

Locally AMS currently has UI access to Bulkrax Import and Export: # Import ![image](https://github.com/scientist-softserv/ams/assets/10081604/8c0ca10f-0a48-4aa4-b4cf-b34a4317a788) # Export ![image](https://github.com/scientist-softserv/ams/assets/10081604/4115305a-e156-4661-8321-229edd1ab35f) https://ams2-demo.wgbh-mla.org/exporters/new?locale=en
Successfully able to import a csv ![image](https://github.com/scientist-softserv/ams/assets/10081604/b047dc42-b304-41eb-bb34-72d7bc77dd0d) test file: [stewart_donald.csv](https://github.com/scientist-softserv/ams/files/12346601/stewart_donald.csv)

STAGING

:warning: In staging, it looks like we've hidden the export link however the user can still access it from the UI: :warning: - it's probably safe to assume that exporting is not actively being used via bulkrax - export by importer from the example above results in a 504 gateway timeout error. - no sentry errors reported ![image](https://github.com/scientist-softserv/ams/assets/10081604/a6cb42e7-35d4-4e50-b321-90058f3fc66c)
ShanaLMoore commented 1 year ago

hyrax batch ingest

met w Drew:

If Bulkrax is enabled, a user gets redirected to the bulkrax importers page when hitting /batches endpoint.

ref: https://github.com/scientist-softserv/ams/blob/develop/app/controllers/hyrax/batch_ingest/batches_controller_decorator.rb#L27-L34

I temporarily disabled this overridden controller locally to test out the batch ingest w the upgrades ...

batches is available in the side but but it is hidden behind bulkrax env.

search bar => facets > ability to export

Screenshot 2023-08-15 at 10 05 24 AM

if it's small enough it'll download immediately. if not, you'll get an email w a link to an s3 bucket.

image

image

Need to change the format.

/batches => upload choosing corresponds to csv used to upload assets

image

Bulkrax is only used for PB core ingest from ams.

If there are major changes on how the actor stack changed then Drew expecte we'll have problems.

batch level ingest config:

reader step parsing and maps. you'll have a batch record in the database and a bunch of batch items attached to it. they correspond to sidekiq jobs from the batch item table.

image

I am not seeing code level evidence that exports based on search is supported.

ShanaLMoore commented 1 year ago

blocked by UI issue.

I'm unable to go through the workflow Drew demonstrated today because we I perform an empty search the export buttons do not render. This will likely be blocked until the bootstrap 4 upgrade is complete #32

image
ShanaLMoore commented 1 year ago

However, I was able to test the batch ingest and it appears to be working despite the upgrades.

image

associated error seems to be a result of bad data:

Error:
invalid source file submitted: /tmp/RackMultipart20230815-1921-11xzc97.csv <br>Unknown column `` Unable to parse CSV.<br>["/app/samvera/hyrax-webapp/app/services/aapb/batch_ingest/csv_reader.rb:34:in `block in validate_csv_header'", "/app/samvera/hyrax-webapp/app/services/aapb/batch_ingest/csv_reader.rb:33:in `each'", "/app/samvera

Bulkrax appears to be working. I tested CSV and XLM imports. I just need to get appropriate XML/manifest data from the client to test its parser.

image

image

Overall, it looks like both hyrax-batch ingest and bulkrax are working with the upgrades. At this point it may be best to continue letting the client use their current workflow because the lift to implement parity in bulkrax would be big. Although eventually I recommend it so that the client can easily stay current with bulkrax and community standards.

Now I will dig into the pipeline to see if I can get our hyrax-batch-ingest branch merged. Unless they tell us to do this, Rob recommends that Drew and his team take this on.

ShanaLMoore commented 1 year ago

Update:

tldr; they should keep using batch ingest after all. It appears to be working still but I would like them to test to confirm once we provide an environment for them.

My current bulkrax estimate is at least a 13 but it also may be because I don't totally understand it. They have 6 different parsers for every import use case - would we need to do that for bulkrax? I would need more time to study them and note their differences.

The search to export functionality only should be smaller and doable, but I don't see evidence that we've done this before. Perhaps we put it into an individual project (we should ask the team) but I don't see it in bulkrax proper.

Should I spend more time breaking this down or can we move on and save it for another time?

related tickets:

  1. https://github.com/scientist-softserv/ams/issues/58
  2. https://github.com/scientist-softserv/ams/issues/57
  3. https://github.com/scientist-softserv/ams/issues/69
  4. hyrax batch ingest docs
ShanaLMoore commented 1 year ago

batch ingest Export works and produced this CSV:

export-assets-2023-08-16_154905.csv

ShanaLMoore commented 1 year ago

From what I can tell so far, upgrading hyrax-batch_ingest's dependencies doesn't seem to negatively effect its functionality.

I would like to have the clients test and confirm this, but I think it's OK and they should continue using it for now as the lift to make parity for bulkrax would be much larger.

Also to add per Rob, getting our https://github.com/samvera-labs/hyrax-batch_ingest/pull/152 merged into main should fall on Drew and his team since main's pipeline is broken (unless they want our devs to spend time trying to resolve this).

cc @jillpe