sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

airflow harvest converts numeric strings to floats causing transform errors #482

Closed jacobthill closed 4 months ago

jacobthill commented 4 months ago

In the LoC collection greek-and-armenian-patriarchates the date field is getting converted to a float dtype which raises an error when we try to parse it in traject. This does not happen when harvested with bin/get or when harvested in airflow for the csv file. To reproduce, run airflow locally and compare the csv and json files.

This is the first record in json:

{
  "id": "http:\/\/www.loc.gov\/item\/00271073823-jo\/",
  "date": 1600000,
  "identifier": "Ethiopic: Reel 11",
  "language": ["amharic", "geez"],
  "format": ["manuscript\/mixed material"],
  "preview": ["https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:6.25\/0\/default.jpg#h=175&w=259", "https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:12.5\/0\/default.jpg#h=351&w=518", "https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:25\/0\/default.jpg#h=702&w=1036", "https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:50\/0\/default.jpg#h=1404&w=2072", "https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:100\/0\/default.jpg#h=2808&w=4145"],
  "shown_at": "https:\/\/www.loc.gov\/item\/00271073823-jo\/",
  "subject": ["armenian church. erusaghēmi patriarkʻutʻiwn", "greek orthodox patriarchate of jerusalem", "manuscripts, greek", "jerusalem", "manuscripts", "manuscripts, armenian"],
  "title": "Ethiopic 11. Evangelion (John). 17th cent. 76 f. Pg. 1 illum. 11 ft.",
  "type": ["Manuscript"],
  "description": null
}

and this is the same record in csv:

http://www.loc.gov/item/00271073823-jo/,1600,Ethiopic: Reel 11,"['amharic', 'geez']",['manuscript/mixed material'],"['https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073823-jo:0005/full/pct:6.25/0/default.jpg#h=175&w=259', 'https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073823-jo:0005/full/pct:12.5/0/default.jpg#h=351&w=518', 'https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073823-jo:0005/full/pct:25/0/default.jpg#h=702&w=1036', 'https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073823-jo:0005/full/pct:50/0/default.jpg#h=1404&w=2072', 'https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073823-jo:0005/full/pct:100/0/default.jpg#h=2808&w=4145']",https://www.loc.gov/item/00271073823-jo/,"['armenian church. erusaghēmi patriarkʻutʻiwn', 'greek orthodox patriarchate of jerusalem', 'manuscripts, greek', 'jerusalem', 'manuscripts', 'manuscripts, armenian']",Ethiopic 11. Evangelion (John). 17th cent. 76 f. Pg. 1 illum. 11 ft.,['Manuscript'],

And this is how the json looks when harvested with bin/get:

{
  "id": "http:\/\/www.loc.gov\/item\/00271073823-jo\/",
  "date": "1600",
  "identifier": "Ethiopic: Reel 11",
  "language": ["amharic", "geez"],
  "format": ["manuscript\/mixed material"],
  "preview": ["https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:6.25\/0\/default.jpg#h=175&w=259", "https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:12.5\/0\/default.jpg#h=351&w=518", "https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:25\/0\/default.jpg#h=702&w=1036", "https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:50\/0\/default.jpg#h=1404&w=2072", "https:\/\/tile.loc.gov\/image-services\/iiif\/service:amed:amedmonastery:00271073823-jo:0005\/full\/pct:100\/0\/default.jpg#h=2808&w=4145"],
  "shown_at": "https:\/\/www.loc.gov\/item\/00271073823-jo\/",
  "subject": ["armenian church. erusaghēmi patriarkʻutʻiwn", "greek orthodox patriarchate of jerusalem", "manuscripts, greek", "jerusalem", "manuscripts", "manuscripts, armenian"],
  "title": "Ethiopic 11. Evangelion (John). 17th cent. 76 f. Pg. 1 illum. 11 ft.",
  "type": ["Manuscript"]
}

I'm not sure where the error creeps in but ideally bin/get would use the same code so it doesn't produce different results from the airflow DAG.

This ticket is related: https://github.com/sul-dlss/dlme-airflow/issues/436

edsu commented 4 months ago

When running the DAG it at first appears that the transform_validation step is failing:

2024-04-02, 15:21:42 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/dlme_airflow/tasks/transform_validation.py", line 35, in validate_transformation
    raise Exception(
Exception: ERROR: failed to transform all harvested records: harvested record count (1009) != transformed record count (0)
[2024-04-02, 15:21:42 UTC] {taskinstance.py:1345} INFO - Marking task as FAILED. dag_id=loc, task_id=LOC_ETL.greek-and-armenian-patriarchates_etl.loc_greek-and-armenian-patriarchates_transform_validation, execution_date=20240324T070000, start_date=20240402T152141, end_date=20240402T152142
[2024-04-02, 15:21:42 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 10 for task LOC_ETL.greek-and-armenian-patriarchates_etl.loc_greek-and-armenian-patriarchates_transform_validation (ERROR: failed to transform all harvested records: harvested record count (1009) != transformed record count (0); 342)
[2024-04-02, 15:21:42 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2024-04-02, 15:21:42 UTC] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check

But this failure is because the previous transform step had an error, which doesn't show up in Airflow as an error (which should be a separate issue):

[2024-04-02, 15:21:20 UTC] {docker.py:403} INFO - 2024-04-02T15:21:20+00:00 ERROR Unexpected error on record <record #5 (/opt/***/working/loc/greek_and_armenian_patriarchates/data.json #5), output_id:loc-00271073872-jo>
    while executing (to_field "cho_date_range_norm" at traject_configs/loc.rb:61)
    Record: {"id"=>"http://www.loc.gov/item/00271073872-jo/", "date"=>1500000, "identifier"=>"Ethiopic: Reel 15", "language"=>["geez"], "format"=>["manuscript/mixed material"], "preview"=>["https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073872-jo:0005/full/pct:12.5/0/default.jpg#h=315&w=442", "https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073872-jo:0005/full/pct:25/0/default.jpg#h=630&w=884", "https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073872-jo:0005/full/pct:50/0/default.jpg#h=1261&w=1768", "https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073872-jo:0005/full/pct:100/0/default.jpg#h=2523&w=3537"], "shown_at"=>"https://www.loc.gov/item/00271073872-jo/", "subject"=>["armenian church. erusaghēmi patriarkʻutʻiwn", "greek orthodox patriarchate of jerusalem", "manuscripts, greek", "jerusalem", "manuscripts", "manuscripts, armenian"], "title"=>"Ethiopic 15. Psalter. 16th/17th cent. 115 f. Pg. 14 ft.", "type"=>["Manuscript"], "description"=>nil}
    Exception: NoMethodError: undefined method `strip' for 1500000:Integer
    /opt/traject/lib/macros/date_parsing.rb:100:in `block (2 levels) in parse_range'
[2024-04-02, 15:21:20 UTC] {docker.py:403} INFO - [ERROR] undefined method `strip' for 1500000:Integer
[2024-04-02, 15:21:20 UTC] {docker.py:403} INFO - /opt/traject/lib/macros/date_parsing.rb:100:in `block (2 levels) in parse_range': undefined method `strip' for 1500000:Integer (NoMethodError)
    from /opt/traject/lib/macros/date_parsing.rb:99:in `each'
    from /opt/traject/lib/macros/date_parsing.rb:99:in `block in parse_range'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer/step.rb:140:in `block in execute'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer/step.rb:135:in `each'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer/step.rb:135:in `execute'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:472:in `block (2 levels) in map_to_context!'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:512:in `handle_mapping_errors'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:471:in `block in map_to_context!'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:465:in `each'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:465:in `map_to_context!'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:590:in `block (3 levels) in process'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/thread_pool.rb:123:in `block in maybe_in_thread_pool'
    from /usr/local/bundle/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/abstract_executor_service.rb:94:in `block in fallback_action'
    from /usr/local/bundle/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_executor_service.rb:27:in `post'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/thread_pool.rb:121:in `maybe_in_thread_pool'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:589:in `block (2 levels) in process'
    from /usr/local/bundle/gems/traject_plus-1.3.0/lib/traject_plus/json_reader.rb:20:in `each'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:553:in `block in process'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:546:in `each'
    from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:546:in `process'
    from /opt/traject/lib/transformer.rb:22:in `transform'
    from /opt/traject/lib/cli.rb:78:in `block in transform_all'
    from /opt/traject/lib/cli.rb:72:in `each'
    from /opt/traject/lib/cli.rb:72:in `transform_all'
    from /opt/traject/lib/cli.rb:56:in `transform'
    from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
    from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
    from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
    from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
    from exe/transform:8:in `<main>'

I commented out the filter_data operations and these worked fine, so I think there's something going on in filter_data which is causing the Year to get mangled. filter_data is not currently exercised as part of bin/get.