scientist-softserv / adventist_knapsack

Apache License 2.0
2 stars 0 forks source link

CSV import not splitting multiple values on ; or | #333

Closed KatharineV closed 7 months ago

KatharineV commented 1 year ago

During early bug fixes for Bulkrax CSV imports, I know that the team fixed our instance so that the subject field would import and split multiple values on the semicolon. I have also tested multiple file upload via CSV, and the multiple files are recognized with the semicolon delimiter. However, today in some tests on staging I see works with multiple authors are not splitting this field on either the semicolon or the pipe (which I tried just to see). I don't know if we ever tested the author field before, so I can't say if this is just a problem with that field or a larger issue. Thanks for looking into it and helping us!

Example with pipe https://sdapi.s2.adventistdigitallibrary.org/concern/journal_articles/record_spd_2023_07_15_8a_1_1_1_1?locale=en importer https://sdapi.s2.adventistdigitallibrary.org/importers/64?locale=en

Example with semicolon https://sdapi.s2.adventistdigitallibrary.org/concern/journal_articles/recordx_spd_2023_07_15_5_kellyville_church_celebrates_130_years?locale=en Importer https://sdapi.s2.adventistdigitallibrary.org/importers/63?locale=en

Testing Instructions

NOTE: This is not a knapsack ticket so it should get tested against adventist proper's staging

A user should be able to import a csv with multi values. Mutli value means the cell could look like A; B; C or A | B | C. When that field gets saved, visiting the work's show page should display multiple values.

NOTE: This is only applicable to properties/fields as defined by the bulkrax.rb 'split' configuration. Additionally, it's possible to assume that not all properties get displayed, so verify that it should before marking it for rework.

Sample CSV: Mission_Spotlight_bulk_upload_NewCaledonia.csv

KatharineV commented 1 year ago

Hi team, I just ran a CSV import today with ; as the delimiter for two fields. The work type was Generic. The split worked for Subjects but it did not split multiple entries for the creator field.

Importer: https://adl.b2.adventistdigitallibrary.org/importers/109?locale=en

The CSV I used is attached.

Mission_Spotlight_bulk_upload_NewCaledonia.csv

kirkkwang commented 1 year ago

@KatharineV

Here is the CSV parser and all the fields marked with a split: ';' means the ; delimeter will work. Currently, the creator field does not have that set. We can definitely add the split for creator, but I'm curious while I'm in here, which other fields should we put the split on?

    config.field_mappings['Bulkrax::CsvParser'] = {
        'abstract' => { from:  ['description.abstract'] },
        'aark_id' => { from:  ['identifier.ark'] },
        'identifier' => { from:  ['identifier'], source_identifier: true },
        'bibliographic_citation' => { from:  ['identifier.bibliographicCitation'] },
        'creator' => { from:  ['creator'] },
        'contributor' => { from:  ['contributor'] },
        'edition' => { from:  ['title.release'] },
        'alternative_title' => { from:  ['title.alternative'] },
        'resource_type' => { from:  ['type'] },
        'issue_number' => { from:  ['relation.isPartOfIssue'] },
        'language' => { from:  ['language'] },
        'description' => { from:  ['description'] },
        'pagination' => { from:  ['format.extent'] },
        'extent' => { from:  ['format.extent'], split: ';' },
        'source' => { from:  ['source'] },
        'date_issued' => { from:  ['date'] },
        'alt' => { from:  ['coverage.spatial'] },
        'publisher' => { from:  ['publisher'], split: ';' },
        'rights_statement' => { from:  ['rights'] },
        'part_of' => { from:  ['relation.isPartOf'] },
        'part' => { from:  ['relation.isPartOf'] },
        'date_created' => { from:  ['date.other'] },
        'title' => { from:  ['title'] },
        'subject' => { from:  ['subject'], split: ';' },
        'volume_number' => { from:  ['relation.isPartOfVolume'] },
        'keyword' => { from: ['keyword'], split: ';' },
        'location' => { from: ['location'], split: ';' },
        'model' => { from: ['work_type'] },
        'remote_files' => { from: ['related_url'], split: ';', parsed: true },
        'remote_url' => { from: ['official_url', 'remote_url'], split: ';' },
        'thumbnail_url' => { from: ['thumbnail_url'], default_thumbnail: true, parsed: true },
        'video_embed' => { from: ['video_embed'] },
        'refereed' => { from: ['peer_reviewed'] }
    }

In my local dev environment, I set the split for creator and this is how it looks:

image
KatharineV commented 1 year ago

Kirk, the example above looks perfect and like what I expected. Thanks for adding the split delimiter on additional fields. I assume we could use ennumerated headers for any of the fields? But adding the ; delimiter will let us import multiple values a second way? Please correct me if I'm wrong.

I tried to imagine every possible scenario where we'd need to import multiple values for a field. These are the fields to add the semi-colon delimiter:

Creator, contributor, language, description, source, part_of

Thanks!

ShanaLMoore commented 1 year ago

@kirkkwang Would you have a sample csv you could attach to the testing instructions of this ticket? Also, can you confirm whether this work has also been done for adv knapsack?

jillpe commented 1 year ago

SoftServ QA: ✅ Pass!

Import URL Work URL

https://github.com/scientist-softserv/adventist-dl/assets/84697174/ff711feb-514c-4144-861e-50308d7f5657

KatharineV commented 1 year ago

Tested this on ADL staging (importer here) and it worked as expected. Thank you!