sul-dlss-labs / rialto-airflow

Airflow for harvesting data for research intelligence and open access analysis
Apache License 2.0
1 stars 0 forks source link

openalex_harvest_pub task fails during results pagination #57

Closed lwrubel closed 1 month ago

lwrubel commented 2 months ago

When running with a dev_limit of 1000, the openalex_harvest_pub task sometimes fails with:

[2024-06-25, 18:29:26 UTC] {taskinstance.py:2905} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 465, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 432, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/decorators/base.py", line 265, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 235, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 252, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/rialto_airflow/dags/harvest.py", line 95, in openalex_harvest_pubs
    openalex.publications_csv(dois, csv_file)
  File "/opt/airflow/rialto_airflow/harvest/openalex.py", line 76, in publications_csv
    for pub in publications_from_dois(dois):
  File "/opt/airflow/rialto_airflow/harvest/openalex.py", line 89, in publications_from_dois
    for page in Works().filter(doi=doi_list).paginate(per_page=200):
  File "/home/airflow/.local/lib/python3.12/site-packages/pyalex/api.py", line 152, in __next__
    self._next_value = meta["next_cursor"]
                       ~~~~^^^^^^^^^^^^^^^
KeyError: 'next_cursor'
[2024-06-25, 18:29:26 UTC] {taskinstance.py:1206} INFO - Marking task as FAILED. dag_id=harvest, task_id=openalex_harvest_pubs, run_id=manual__2024-06-25T17:23:53.071843+00:00, execution_date=20240625T172353, start_date=20240625T182053, end_date=20240625T182926
[2024-06-25, 18:29:26 UTC] {standard_task_runner.py:110} ERROR - Failed to execute job 61 for task openalex_harvest_pubs ('next_cursor'; 857)
[2024-06-25, 18:29:26 UTC] {local_task_job_runner.py:240} INFO - Task exited with return code 1
lwrubel commented 2 months ago

Failed again with:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 465, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py", line 432, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/decorators/base.py", line 265, in execute
    return_value = super().execute(context)
                   ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py", line 401, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 235, in execute
    return_value = self.execute_callable()
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/operators/python.py", line 252, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/airflow/rialto_airflow/dags/harvest.py", line 95, in openalex_harvest_pubs
    openalex.publications_csv(dois, csv_file)
  File "/opt/airflow/rialto_airflow/harvest/openalex.py", line 76, in publications_csv
    for pub in publications_from_dois(dois):
  File "/opt/airflow/rialto_airflow/harvest/openalex.py", line 89, in publications_from_dois
    for page in Works().filter(doi=doi_list).paginate(per_page=200):
  File "/home/airflow/.local/lib/python3.12/site-packages/pyalex/api.py", line 147, in __next__
    results, meta = self.endpoint_class.get(
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/pyalex/api.py", line 293, in get
    return self._get_from_url(self.url, return_meta=return_meta)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/airflow/.local/lib/python3.12/site-packages/pyalex/api.py", line 265, in _get_from_url
    raise QueryError(res.json()["message"])
pyalex.api.QueryError: 4_2|10.58530/2022/1705|10.7910/dvn/b4uo2l/grm2dr|10.1016/j.carbon.2015.01.017|10.1016/j.jnoncrysol.2014.04.003|10.1053/j.jrn.2010.09.007|10.1186/gb-2010-11-1-r10|10.2312/vcbm.20161282|10.1002/jmri.23614|10.1007/s00401-018-1859-2|10.1145/3322126|10.7910/dvn/b4uo2l/rkccfs|10.1152/ajpheart.00790.2002|10.1007/978-981-15-3449-2_6|10.1056/nejmoa032520|10.1128/aem.03950-14|10.1109/ectc.2018.00061|10.1016/j.healun.2015.10.039|10.3847/1538-4357/ac0053|10.1038/leu.2011.213|10.1353/rus.2016.0005|10.1002/adfm.201201848|10.1007/s10955-019-02249-9|10.1016/j.athoracsur.2021.07.058|10.21203/rs.3.rs-2883579/v1|10.1115/imece1998-0246|10.7910/dvn/ttlqrn/o1w8s7|10.1038/s41598-022-21510-y|10.7910/dvn/8dushz/budenx|10.7189/jogh.14.04011|10.1126/sciimmunol.aat8116|10.3802/jgo.2021.32.e14|10.1111/j.1552-6909.2012.01379.x|10.1182/blood-2018-99-113328|10.1145/3449101|10.1021/acsnano.5b02432|10.1016/j.jss.2024.01.009|10.1046/j.1440-0952.2002.00932.x|10.1101/767988|10.1016/j.gca.2022.12.008|10.1089/wound.2016.0709 is not a valid parameter. Valid parameters are: apc_sum, cited_by_count_sum, cursor, filter, format, group_by, group-by, group_bys, group-bys, mailto, page, per_page, per-page, q, sample, seed, search, select, sort.
lwrubel commented 1 month ago

Closing since we haven't seen this in recent runs.