sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

Harvard SCW failing harvest #271

Closed jacobthill closed 1 year ago

jacobthill commented 1 year ago

  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 171, in execute
    return_value = self.execute_callable()
  File "/home/airflow/.local/lib/python3.9/site-packages/airflow/operators/python.py", line 189, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/dlme_airflow/harvester/source_harvester.py", line 17, in data_source_harvester
    dataframe_to_file(collection)
  File "/home/airflow/.local/lib/python3.9/site-packages/dlme_airflow/utils/dataframe.py", line 36, in dataframe_to_file
    source_df = collection.catalog.read().drop_duplicates(
  File "/home/airflow/.local/lib/python3.9/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/home/airflow/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 6122, in drop_duplicates
    duplicated = self.duplicated(subset, keep=keep)
  File "/home/airflow/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 6259, in duplicated
    labels, shape = map(list, zip(*map(f, vals)))
  File "/home/airflow/.local/lib/python3.9/site-packages/pandas/core/frame.py", line 6232, in f
    labels, shape = algorithms.factorize(vals, size_hint=len(self))
  File "/home/airflow/.local/lib/python3.9/site-packages/pandas/core/algorithms.py", line 763, in factorize
    codes, uniques = factorize_array(
  File "/home/airflow/.local/lib/python3.9/site-packages/pandas/core/algorithms.py", line 560, in factorize_array
    uniques, codes = table.factorize(
  File "pandas/_libs/hashtable_class_helper.pxi", line 5394, in pandas._libs.hashtable.PyObjectHashTable.factorize
  File "pandas/_libs/hashtable_class_helper.pxi", line 5310, in pandas._libs.hashtable.PyObjectHashTable._unique
TypeError: unhashable type: 'list'
[2022-09-27, 17:00:24 UTC] {taskinstance.py:1415} INFO - Marking task as FAILED. dag_id=harvard, task_id=HARVARD_ETL.scw_etl.harvard_scw_harvest, execution_date=20220926T170000, start_date=20220927T170021, end_date=20220927T170024
[2022-09-27, 17:00:24 UTC] {standard_task_runner.py:92} ERROR - Failed to execute job 4223 for task HARVARD_ETL.scw_etl.harvard_scw_harvest (unhashable type: 'list'; 951)
[2022-09-27, 17:00:24 UTC] {local_task_job.py:156} INFO - Task exited with return code 1
[2022-09-27, 17:00:24 UTC] {local_task_job.py:273} INFO - 0 downstream tasks scheduled from follow-on schedule check```
edsu commented 1 year ago

@jacobthill do you see this error when running bin/get harvard scw?

I noticed that it returns ok, but that there are multiple values for the ID column, which could be problematic?

id,title,type,genre,place,start_date,end_date,physical_description,subject,physical_location,shelf_locator,record_identifier,language,cdwastyle,cdwaculture,shown_at,preview
"['22535138', '8000738690_URN-3:FHCL:35350111']","['Radha and Krishna (Kanoria Collection)', 'Triumph of Radha', 'Detail']","['still image', 'still image']","['paintings', 'color slide']","Kishangarh, Rājasthān, India",1770,1770,23 x 18.5 cm,"['lovers', 'Hinduism', 'Hindu gods', 'female heads']","Gopi Krishna Kanoria Collection, Patna, Bihār, India",SCW2016.15325,"['22535138', '8000738690_URN-3:FHCL:35350111']",zxx,Rajasthani,Indian,https://id.lib.harvard.edu/images/8000738690/urn-3:FHCL:35350111/catalog,https://nrs.harvard.edu/urn-3:FHCL:35350111?width=150&height=150&usethumb=y