Can no longer run a "create" task with additional media files after updating to latest Workbench

dara2 commented 4 months ago

I updated to the latest version of Workbench. My CSV includes 4 columns for additional media files, called PDF, mediatrack, HOCR, and transcript. These are defined in my YML with their Media Use term IDs:

additional_files:
 - HOCR: 49 
 - PDF: 51
 - transcript: 9
 - mediatrack: 50

I ran --check and everything validated. Then when I tried to run the ingest, it errored immediately, and from the error it seems that it still thinks these column headers should match Drupal field names:

islandora@rcl-isle-up01:/opt/islandora_workbench$ ./workbench --config Batch_1.yml
OK, connection to Drupal at https://dc-i2-prod.lib.rochester.edu verified.
"Create" task started using config file Batch_1.yml.
Only node IDs for parents created during this session will be used (not using the CSV ID to node ID map).
Node for "Ellwanger and Barry Horticultural Prints" (record ur:5918) created at https://dc-i2-prod.lib.rochester.edu/node/707.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /opt/islandora_workbench/./workbench:3427 in <module>                        │
│                                                                              │
│   3424                                                                       │
│   3425 try:                                                                  │
│   3426 │   if config["task"] == "create":                                    │
│ ❱ 3427 │   │   create()                                                      │
│   3428 │   if config["task"] == "update":                                    │
│   3429 │   │   update()                                                      │
│   3430 │   if config["task"] == "delete":                                    │
│                                                                              │
│ /opt/islandora_workbench/./workbench:397 in create                           │
│                                                                              │
│    394 │   │   │   │   node_uri,                                             │
│    395 │   │   │   )                                                         │
│    396 │   │   │   if "output_csv" in config.keys():                         │
│ ❱  397 │   │   │   │   write_to_output_csv(config, id_field, node_response.t │
│    398 │   │   else:                                                         │
│    399 │   │   │   message = "Node for CSV record " + id_field + " not creat │
│    400 │   │   │   print("ERROR: " + message + ".")                          │
│                                                                              │
│ /opt/islandora_workbench/workbench_utils.py:7918 in write_to_output_csv      │
│                                                                              │
│    7915 │   │   │   │   │   config, field_definitions, field_name, node_dict │
│    7916 │   │   │   │   )                                                    │
│    7917 │   │   row.update(input_csv_row)                                    │
│ ❱  7918 │   writer.writerow(row)                                             │
│    7919 │   csvfile.close()                                                  │
│    7920                                                                      │
│    7921                                                                      │
│                                                                              │
│ /usr/lib/python3.10/csv.py:154 in writerow                                   │
│                                                                              │
│   151 │   │   return (rowdict.get(key, self.restval) for key in self.fieldna │
│   152 │                                                                      │
│   153 │   def writerow(self, rowdict):                                       │
│ ❱ 154 │   │   return self.writer.writerow(self._dict_to_list(rowdict))       │
│   155 │                                                                      │
│   156 │   def writerows(self, rowdicts):                                     │
│   157 │   │   return self.writer.writerows(map(self._dict_to_list, rowdicts) │
│                                                                              │
│ /usr/lib/python3.10/csv.py:149 in _dict_to_list                              │
│                                                                              │
│   146 │   │   if self.extrasaction == "raise":                               │
│   147 │   │   │   wrong_fields = rowdict.keys() - self.fieldnames            │
│   148 │   │   │   if wrong_fields:                                           │
│ ❱ 149 │   │   │   │   raise ValueError("dict contains fields not in fieldnam │
│   150 │   │   │   │   │   │   │   │    + ", ".join([repr(x) for x in wrong_f │
│   151 │   │   return (rowdict.get(key, self.restval) for key in self.fieldna │
│   152                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
ValueError: dict contains fields not in fieldnames: 'PDF', 'mediatrack', 'HOCR',
'transcript'
islandora@rcl-isle-up01:/opt/islandora_workbench$

dara2 commented 3 months ago

This is blocking our ability to run ingests. (I hope it's not anything I'm doing incorrectly in the yml file!) We're going to roll back to an October version, since we know September was working but we need the URL alias fix that was done in October.

mjordan commented 3 months ago

Do you also have output_csv in your config file?

dara2 commented 3 months ago

I have these:

output_csv: Batch_1_WithoutVideo-output.csv
output_csv_include_input_csv: true

Like I always have had in my config files. Did something change there?

mjordan commented 3 months ago

I don't know what happened. Additional files works fine without output_csv (and there are pretty complete integration tests for that) but the error is occurring code that writes out the CSV file. I will investigate.

mjordan commented 3 months ago

@dara2 please check out the updates to the main branch and test. Passes all tests on my end but let me know how it goes for you.

dara2 commented 3 months ago

Thanks, Mark! Checking now.

dara2 commented 3 months ago

Hi Mark - My ingest passed the --check, but then when I tried to run it I got a new error:

(awesome) Born-Digitals-MacBook-Air-4:islandora_workbench Dara$ ./workbench --config demoBDcreate.yml
OK, connection to Drupal at https://bd-i8-stage.born-digital.com verified.
"Create" task started using config file demoBDcreate.yml.
Only node IDs for parents created during this session will be used (not using the CSV ID to node ID map).
Node for "Library Bulletin" (record ur:786) created at https://bd-i8-stage.born-digital.com/node/1000003025.
- Media for "additional_files" CSV column "HOCR" in row with ID "ur:786" (node URL "https://bd-i8-stage.born-digital.com/node/1000003025") not created. See log for more information.
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│                                                                              │
│ /Users/Dara_1/islandora_workbench/./workbench:3432 in <module>               │
│                                                                              │
│   3429                                                                       │
│   3430 try:                                                                  │
│   3431 │   if config["task"] == "create":                                    │
│ ❱ 3432 │   │   create()                                                      │
│   3433 │   if config["task"] == "update":                                    │
│   3434 │   │   update()                                                      │
│   3435 │   if config["task"] == "delete":                                    │
│ /Users/Dara_1/islandora_workbench/./workbench:271 in create                  │
│                                                                              │
│    268 │   │   │   # Entity reference fields (taxonomy_term and node).       │
│    269 │   │   │   if field_definitions[custom_field]["field_type"] == "enti │
│    270 │   │   │   │   entity_reference_field = workbench_fields.EntityRefer │
│ ❱  271 │   │   │   │   node = entity_reference_field.create(                 │
│    272 │   │   │   │   │   config, field_definitions, node, row, custom_fiel │
│    273 │   │   │   │   )                                                     │
│    274                                                                       │
│                                                                              │
│ /Users/Dara_1/islandora_workbench/workbench_fields.py:759 in create          │
│                                                                              │
│    756 │   │   │   target_type = "media_type"                                │
│    757 │   │                                                                 │
│    758 │   │   field_values = []                                             │
│ ❱  759 │   │   subvalues = row[field_name].split(config["subdelimiter"])     │
│    760 │   │   subvalues = self.dedupe_values(subvalues)                     │
│    761 │   │   for subvalue in subvalues:                                    │
│    762 │   │   │   subvalue = str(subvalue)                                  │
╰──────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'int' object has no attribute 'split'
(awesome) Born-Digitals-MacBook-Air-4:islandora_workbench Dara$

Here's my config, if that helps:

host: "https://bd-i8-stage.born-digital.com/"
username: xxx
password: xxx
input_dir: i8demo_BD
input_csv: Newspapers_bd-base.csv
allow_missing_files: true
allow_adding_terms: true
perform_soft_checks: true
standalone_media_url: true
field_for_media_title: field_pid
field_for_remote_filename: field_pid
delete_tmp_upload: true
adaptive_pause: 2
adaptive_pause_threshold: 2.5 
log_term_creation: false
http_cache_storage: memory
http_cache_storage_expire_after: 600
# query_csv_id_to_node_id_map_for_parents: false
additional_files:
 - HOCR: 833

And here are the first few rows of my sheet:

id,field_pid,parent_id,field_weight,HOCR,field_model,title,field_linked_agent,field_identifier,field_resource_type,field_description_long,field_genre,field_language,field_edtf_date_issued,field_place_published,field_extent,field_physical_form,field_note,field_geographic_subject,field_subjects_name,field_subject,field_table_of_contents,field_alternative_title,field_display_hints,field_member_of,url_alias,file
ur:786,ur:786,,,,Newspaper,Library Bulletin,,ur:786,Collection,"The University of Rochester Library Bulletin was published on a regular, and later irregular, basis from 1945 until 1994. Its contributors were library staff and University of Rochester professors, as well as other scholars and researchers.",newspaper,,,,,,,,,,,,,,/islandora/object/ur:786,
ur:3217,ur:3217,ur:786,,,Publication Issue,"University of Rochester Library Bulletin, v. 6, no. 3",relators:cre:corporate_body:University of Rochester. Library (Creator),ur:3217,Collection,,Periodicals,English (eng),~1951-06,"Rochester, N.Y.",41-56 pages,electronic,,,,,,,,,/islandora/object/ur:3217,https://digitalcollections.lib.rochester.edu/islandora/object/ur:3217/datastream/PDF/view

dara2 commented 3 months ago

I see in the workbench.log file that it is treating the fact that there is no file in the HOCR column for an object as an ERROR:

01-Apr-24 18:32:54 - INFO - Using directory defined in the 'temp_dir' config setting (/var/folders/w9/hdsz9cws5b906b6g4spmglm80000gn/T) as the temporary directory (already exists).
01-Apr-24 18:32:55 - INFO - OK, connection to Drupal at https://bd-i8-stage.born-digital.com verified.
01-Apr-24 18:32:55 - INFO - OK, Islandora Workbench Integration module installed on https://bd-i8-stage.born-digital.com is at version 1.0.0.
01-Apr-24 18:32:55 - INFO - Client-side request caching is enabled.
01-Apr-24 18:32:56 - INFO - "Create" task started using config file demoBDcreate.yml.
01-Apr-24 18:32:56 - INFO - Writing rollback CSV to i8demo_BD/rollback.csv
01-Apr-24 18:32:56 - INFO - 'log_term_creation' configuration setting is False. Creation of new taxonomy terms will not be logged.
01-Apr-24 18:33:26 - WARNING - Only node IDs for parents created during this session will be used (not using the CSV ID to node ID map).
01-Apr-24 18:33:28 - INFO - Node for "Library Bulletin (record ur:786)" created at https://bd-i8-stage.born-digital.com/node/1000003025.
01-Apr-24 18:33:28 - WARNING - No media for https://bd-i8-stage.born-digital.com/node/1000003025 created since its "file" column in the input CSV (row with ID "ur:786") is empty.
01-Apr-24 18:33:28 - ERROR - Media for "additional_files" CSV column "HOCR" in row with ID "ur:786" (node URL "https://bd-i8-stage.born-digital.com/node/1000003025") not created because CSV field is empty.

mjordan commented 3 months ago

The ERROR issue is being addressed in #757. I haven't seen the 'int' object has no attribute 'split' exception before but I'll need to closely inspect your sample CSV to figure that one out. Thanks for including it.

mjordan commented 3 months ago

@dara2 can you check out the issue_756 branch and test to see if the 'int' object has no attribute 'split' problem has been fixed? Please use the same CSV data that you were using when you discovered it.

dara2 commented 3 months ago

Confirmed that issue_756 branch works for me, with the same CSV data from before!

mjordan commented 3 months ago

Excellent, thanks, I'll merge into main now.

(Done with c2d6146f337ed93e45c020cf5f1c019092387a60).

mjordan commented 3 months ago

@dara2 since the ERROR issue is being worked on in another issue (#757) can I reclose this one?

dara2 commented 3 months ago

Yes, thanks!

mjordan / islandora_workbench

Can no longer run a "create" task with additional media files after updating to latest Workbench #756