Closed jacobthill closed 7 months ago
When running the DAG it at first appears that the transform_validation
step is failing:
2024-04-02, 15:21:42 UTC] {taskinstance.py:1824} ERROR - Task failed with exception
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 181, in execute
return_value = self.execute_callable()
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/airflow/.local/lib/python3.11/site-packages/airflow/operators/python.py", line 198, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/airflow/dlme_airflow/tasks/transform_validation.py", line 35, in validate_transformation
raise Exception(
Exception: ERROR: failed to transform all harvested records: harvested record count (1009) != transformed record count (0)
[2024-04-02, 15:21:42 UTC] {taskinstance.py:1345} INFO - Marking task as FAILED. dag_id=loc, task_id=LOC_ETL.greek-and-armenian-patriarchates_etl.loc_greek-and-armenian-patriarchates_transform_validation, execution_date=20240324T070000, start_date=20240402T152141, end_date=20240402T152142
[2024-04-02, 15:21:42 UTC] {standard_task_runner.py:104} ERROR - Failed to execute job 10 for task LOC_ETL.greek-and-armenian-patriarchates_etl.loc_greek-and-armenian-patriarchates_transform_validation (ERROR: failed to transform all harvested records: harvested record count (1009) != transformed record count (0); 342)
[2024-04-02, 15:21:42 UTC] {local_task_job_runner.py:225} INFO - Task exited with return code 1
[2024-04-02, 15:21:42 UTC] {taskinstance.py:2653} INFO - 0 downstream tasks scheduled from follow-on schedule check
But this failure is because the previous transform
step had an error, which doesn't show up in Airflow as an error (which should be a separate issue):
[2024-04-02, 15:21:20 UTC] {docker.py:403} INFO - 2024-04-02T15:21:20+00:00 ERROR Unexpected error on record <record #5 (/opt/***/working/loc/greek_and_armenian_patriarchates/data.json #5), output_id:loc-00271073872-jo>
while executing (to_field "cho_date_range_norm" at traject_configs/loc.rb:61)
Record: {"id"=>"http://www.loc.gov/item/00271073872-jo/", "date"=>1500000, "identifier"=>"Ethiopic: Reel 15", "language"=>["geez"], "format"=>["manuscript/mixed material"], "preview"=>["https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073872-jo:0005/full/pct:12.5/0/default.jpg#h=315&w=442", "https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073872-jo:0005/full/pct:25/0/default.jpg#h=630&w=884", "https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073872-jo:0005/full/pct:50/0/default.jpg#h=1261&w=1768", "https://tile.loc.gov/image-services/iiif/service:amed:amedmonastery:00271073872-jo:0005/full/pct:100/0/default.jpg#h=2523&w=3537"], "shown_at"=>"https://www.loc.gov/item/00271073872-jo/", "subject"=>["armenian church. erusaghēmi patriarkʻutʻiwn", "greek orthodox patriarchate of jerusalem", "manuscripts, greek", "jerusalem", "manuscripts", "manuscripts, armenian"], "title"=>"Ethiopic 15. Psalter. 16th/17th cent. 115 f. Pg. 14 ft.", "type"=>["Manuscript"], "description"=>nil}
Exception: NoMethodError: undefined method `strip' for 1500000:Integer
/opt/traject/lib/macros/date_parsing.rb:100:in `block (2 levels) in parse_range'
[2024-04-02, 15:21:20 UTC] {docker.py:403} INFO - [ERROR] undefined method `strip' for 1500000:Integer
[2024-04-02, 15:21:20 UTC] {docker.py:403} INFO - /opt/traject/lib/macros/date_parsing.rb:100:in `block (2 levels) in parse_range': undefined method `strip' for 1500000:Integer (NoMethodError)
from /opt/traject/lib/macros/date_parsing.rb:99:in `each'
from /opt/traject/lib/macros/date_parsing.rb:99:in `block in parse_range'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer/step.rb:140:in `block in execute'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer/step.rb:135:in `each'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer/step.rb:135:in `execute'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:472:in `block (2 levels) in map_to_context!'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:512:in `handle_mapping_errors'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:471:in `block in map_to_context!'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:465:in `each'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:465:in `map_to_context!'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:590:in `block (3 levels) in process'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/thread_pool.rb:123:in `block in maybe_in_thread_pool'
from /usr/local/bundle/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/abstract_executor_service.rb:94:in `block in fallback_action'
from /usr/local/bundle/gems/concurrent-ruby-1.2.3/lib/concurrent-ruby/concurrent/executor/ruby_executor_service.rb:27:in `post'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/thread_pool.rb:121:in `maybe_in_thread_pool'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:589:in `block (2 levels) in process'
from /usr/local/bundle/gems/traject_plus-1.3.0/lib/traject_plus/json_reader.rb:20:in `each'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:553:in `block in process'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:546:in `each'
from /usr/local/bundle/gems/traject-3.8.2/lib/traject/indexer.rb:546:in `process'
from /opt/traject/lib/transformer.rb:22:in `transform'
from /opt/traject/lib/cli.rb:78:in `block in transform_all'
from /opt/traject/lib/cli.rb:72:in `each'
from /opt/traject/lib/cli.rb:72:in `transform_all'
from /opt/traject/lib/cli.rb:56:in `transform'
from /usr/local/bundle/gems/thor-0.20.3/lib/thor/command.rb:27:in `run'
from /usr/local/bundle/gems/thor-0.20.3/lib/thor/invocation.rb:126:in `invoke_command'
from /usr/local/bundle/gems/thor-0.20.3/lib/thor.rb:387:in `dispatch'
from /usr/local/bundle/gems/thor-0.20.3/lib/thor/base.rb:466:in `start'
from exe/transform:8:in `<main>'
I commented out the filter_data
operations and these worked fine, so I think there's something going on in filter_data
which is causing the Year
to get mangled. filter_data
is not currently exercised as part of bin/get
.
In the LoC collection
greek-and-armenian-patriarchates
thedate
field is getting converted to a float dtype which raises an error when we try to parse it in traject. This does not happen when harvested withbin/get
or when harvested in airflow for thecsv
file. To reproduce, run airflow locally and compare the csv and json files.This is the first record in json:
and this is the same record in csv:
And this is how the json looks when harvested with
bin/get
:I'm not sure where the error creeps in but ideally
bin/get
would use the same code so it doesn't produce different results from the airflow DAG.This ticket is related: https://github.com/sul-dlss/dlme-airflow/issues/436