mjordan / islandora_workbench

A command-line tool for managing content in an Islandora 2 repository
MIT License
24 stars 39 forks source link

Provide option to copy input CSV fields into output CSV #336

Closed mjordan closed 2 years ago

mjordan commented 2 years ago

From a conversation in Slack with @dmer and his colleague Margaret Youngberg: it would be useful to include the field values from the input CSV in the output CSV created using the output_csv option.

mjordan commented 2 years ago

@dmer can you test the issue-336 branch? You will need to add the following two options to your YAML config file:

output_csv: /path/to/your/output.csv
output_csv_include_input_csv: true

The output_csv_include_input_csv allows for the current behaviour to remain in place (no copy of the input CSV data) if omitted. To append the input CSV columns to the output CSV, include this option with a true value.

Note that this only works so far for non-paged content; we'll add it to paged content once we finalize how this works.

dmer commented 2 years ago

@mjordan does this issue branch require the updated Drupal module? I don't want to test this on my client or demo site if it's going to blow up my taxonomy lists! If it does, I'll look into getting that module upgraded so I can test it asap. Thanks!

mjordan commented 2 years ago

Sorry about the poor testing instructions. No, the issue-336 branch doesn't require the recently updated Drupal module, but it also doesn't include the performance optimizations from the issue-312 branch. It will be safe to test the issue-336 branch if you haven't updated the Drupal module in a while.

The update to the Drupal module I did over the weekend (and tagged as 1.0.0) doesn't remove the View that worked pre issue-312, but does include the View required by issue-312. So it is backwards compatible with changes made in issue-312.

TLDR - go ahead and test issue-336 if you haven't updated the Drupal module in a while. Changes in that branch are unrelated to those in issue-312.

dmer commented 2 years ago

@mjordan I can report partial success. I have a sample CSV w/ 3 records that I tested in MAIN and it ingests fine. I switched to the branch issue-336 and ran the same ingest - only one record was created. It was successfully created and the output CSV did have all of the fields and looks correct - again only for the first record.

Nothing got logged to the workbench.log, but I did get this output on the command line when I ran the ingest:

OK, connection to Drupal at https://digital-staging.wolfsonian.org verified. Node for “California Building : Panama-California Exposition, San Diego, Cal. 1915.” (record 1) created at https://digital-staging.wolfsonian.org/node/4027. Traceback (most recent call last): File “./workbench”, line 1072, in <module> create() File “./workbench”, line 397, in create print(‘+ No file specified in CSV for ’ + row[‘title’]) KeyError: ‘title’

Running the same with --check while still in the issue-336 branch gave the following output

OK, configuration file has all required values (did not check for optional values). OK, CSV file input_data/Lib-test-IssueBranch336.csv found. OK, all 3 rows in the CSV file have the same number of columns as there are headers (29). Output CSV already exists at Lib-test-IssueBranch336_output.csv, records will be appended to it. OK, CSV column headers match Drupal field names. OK, required Drupal fields are present in the CSV file. OK, EDTF field values in the CSV file validate. OK, term IDs/names in CSV file exist in their respective taxonomies. Warning: Issues detected with validating taxonomy field values in the CSV file. See the log for more detail. OK, term IDs/names used in typed relation fields in the CSV file exist in their respective taxonomies. Warning: Issues detected with validating typed relation field values in the CSV file. See the log for more detail. OK, files named in the CSV “file” column are all present; the “allow_missing_files” option is enabled and empty “file” values exist. Configuration and input data appear to be valid.

Here's the input CSV file contents:

id,field_resource_type,field_model,parent_id,field_weight,field_member_of,file,media_use_tid,field_display_hints,field_identifier,title,field_alternative_title,field_place_published,field_date_text,field_edtf_date_issued,field_extent,field_height,field_width,field_depth,field_description_long,field_language,field_collection_note,field_genre,field_subjects_name,field_subject,field_subject_pictured,field_temporal_subject,field_geographic_subject,field_linked_agent 1,Collection,Compound Object,,,3141,,,,XB2000.45.178,"California Building : Panama-California Exposition, San Diego, Cal. 1915.",,San Diego,c1914.,1914~,9 x 14 cm. 1 postcard : |color illustrations ,9,14,,,English,,Postcards,,"subject:Exhibitions|subject:Exhibition buildings|subject:Architecture, Baroque|subject:Pavilions|subject:Spectators",,20th century|1910-1920,United States|California|San Diego|UnitedStates, 2,Collection,Compound Object,,,3141,,,,XB1990.1973,Les trente-six vues de la Tour Eiffel,,Paris,1888-1902.,1888/1902,"24 x 30 cm. [12] pages, [36] leaves of plates, [3] pages : |chiefly color illustrations ",24,30,,,French,,Books,,"subject:Tour Eiffel (Paris, France) in art|subject:Eiffel Tower (Paris, France) in art|subject:Artists' books|subject:Printing",subject:Specimens,,France,"relators:pbl:corporate_body:Eugène Verneau|relators:oth:person:Alexandre, Arsène|relators:tyg:person:Auriol, George|relators:oth:person:Baron, Charles" 3,Collection,Compound Object,,,3141,,,,XB1990.1060,History of the World's Fair : being a complete and authentic description of the Columbian Exposition from its inception,,"Philadelphia, Pa.",c1893.,1893~,"27 x 21 cm. 610 pages : |illustrations, portraits ",27,21,,,English,,Books,,subject:Exhibitions|subject:Exhibition buildings|subject:Exhibition grounds|subject:Demographic surveys,,19th century|1893~,United States|Illinois|Chicago,"relators:pbl:corporate_body:Syndicate Publishing Co.|relators:oth:person:Davis, Geo. R."

mjordan commented 2 years ago

Sorry, I hadn't tested with the allow_missing_files: true option. I've pushed a fix to the issue-336 branch if you want to try again.

dmer commented 2 years ago

@mjordan I'm still getting the same result

Node for “California Building : Panama-California Exposition, San Diego, Cal. 1915.” (record 1) created at https://digital-staging.wolfsonian.org/node/4028. Traceback (most recent call last): File “./workbench”, line 1072, in create() File “./workbench”, line 397, in create print(‘+ No file specified in CSV for ’ + row[‘title’]) KeyError: ‘title’

Here's the create.yml I'm using -minus the repository info

input_dir: input_data input_csv: ‘Lib-test-IssueBranch336.csv’ output_csv: ‘Lib-test-IssueBranch336_output.csv’ output_csv_include_input_csv: true allow_missing_files: true allow_adding_terms: true media_use: “‘Thumbnail Image’|'Original File’|'Service File’” log_json: true published: true

mjordan commented 2 years ago

Can you confirm you pulled in the most recent changes to the issue-336 branch? The change is there: https://github.com/mjordan/islandora_workbench/commit/cf250ea3aefcc713cb0e78d989b08aeec147bdf3 .

dmer commented 2 years ago

Works! I repeated my test w/ a known good ingest - ran it through w/ the issue-336 branch and my output.csv has all of the fields from the input csv. This will be very helpful in working w/ this large collection - thanks @mjordan !

mjordan commented 2 years ago

Great, I'll merge that into main now. It should not conflict with some other branches I'm working on.

mjordan commented 2 years ago

Will also update the docs.

dmer commented 2 years ago

Thanks Mark!

mjordan commented 2 years ago

Thanks for suggesting this feature, I am sure others will find it useful.