ucldc / rikolti

calisphere harvester 2.0
BSD 3-Clause "New" or "Revised" License
7 stars 3 forks source link

Errors generating validation reports: nuxeo (2 repositories only) #598

Closed christinklez closed 9 months ago

christinklez commented 1 year ago

Errors reported during the mapping task:

Repository 6: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/validate_by_mapper_type/grid?dag_run_id=manual__2024-01-31T16%3A46%3A52%2B00%3A00

[2024-01-31, 16:49:04 UTC] {{logging_mixin.py:150}} INFO - 26864 : start mapping 2/4      
[2024-01-31, 16:49:24 UTC] {{taskinstance.py:1824}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/decorators/base.py", line 220, in execute
    return_value = super().execute(context)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/dags/rikolti/dags/utils_by_mapper_type.py", line 120, in map_endpoint_task
    mapper_job_results = map_endpoint(endpoint, fetched_versions, limit)
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/map_registry_collections.py", line 56, in map_endpoint
    map_result = lambda_shepherd.map_collection(
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/lambda_shepherd.py", line 105, in map_collection
    mapped_page = map_page(
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/lambda_function.py", line 121, in map_page
    mapped_records = [record.solr_updater() for record in mapped_records]
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/lambda_function.py", line 121, in <listcomp>
    mapped_records = [record.solr_updater() for record in mapped_records]
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/mappers/mapper.py", line 1654, in solr_updater
    self.mapped_data = map_couch_to_solr_doc(self.mapped_data)
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/mappers/mapper.py", line 1594, in map_couch_to_solr_doc
    solr_doc['sort_title'] = normalize_sort_field(sort_title)
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/mappers/mapper.py", line 1224, in normalize_sort_field
    sort_field = sort_field.lower()
AttributeError: 'NoneType' object has no attribute 'lower'
[2024-01-31, 16:49:24 UTC] {{taskinstance.py:1345}} INFO - Marking task as FAILED. dag_id=validate_by_mapper_type, task_id=map_endpoint_task, execution_date=20240131T164652, start_date=20240131T164856, end_date=20240131T164924
[2024-01-31, 16:49:24 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 3181 for task map_endpoint_task ('NoneType' object has no attribute 'lower'; 7027)
[2024-01-31, 16:49:24 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
[2024-01-31, 16:49:24 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check

Repository 25: https://7a8067cb-3b99-477e-a883-7e311175a9b4.c3.us-west-2.airflow.amazonaws.com/dags/validate_by_mapper_type/grid?dag_run_id=manual__2024-01-31T16%3A59%3A43%2B00%3A00&task_id=map_endpoint_task&tab=logs

[2024-01-31, 17:15:00 UTC] {{logging_mixin.py:150}} INFO - 10707 : start mapping 8/73     
[2024-01-31, 17:15:01 UTC] {{taskinstance.py:1824}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/decorators/base.py", line 220, in execute
    return_value = super().execute(context)
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 181, in execute
    return_value = self.execute_callable()
  File "/usr/local/airflow/.local/lib/python3.10/site-packages/airflow/operators/python.py", line 198, in execute_callable
    return self.python_callable(*self.op_args, **self.op_kwargs)
  File "/usr/local/airflow/dags/rikolti/dags/utils_by_mapper_type.py", line 120, in map_endpoint_task
    mapper_job_results = map_endpoint(endpoint, fetched_versions, limit)
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/map_registry_collections.py", line 56, in map_endpoint
    map_result = lambda_shepherd.map_collection(
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/lambda_shepherd.py", line 105, in map_collection
    mapped_page = map_page(
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/lambda_function.py", line 121, in map_page
    mapped_records = [record.solr_updater() for record in mapped_records]
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/lambda_function.py", line 121, in <listcomp>
    mapped_records = [record.solr_updater() for record in mapped_records]
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/mappers/mapper.py", line 1654, in solr_updater
    self.mapped_data = map_couch_to_solr_doc(self.mapped_data)
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/mappers/mapper.py", line 1594, in map_couch_to_solr_doc
    solr_doc['sort_title'] = normalize_sort_field(sort_title)
  File "/usr/local/airflow/dags/rikolti/metadata_mapper/mappers/mapper.py", line 1224, in normalize_sort_field
    sort_field = sort_field.lower()
AttributeError: 'NoneType' object has no attribute 'lower'
[2024-01-31, 17:15:02 UTC] {{taskinstance.py:1345}} INFO - Marking task as FAILED. dag_id=validate_by_mapper_type, task_id=map_endpoint_task, execution_date=20240131T165943, start_date=20240131T171457, end_date=20240131T171502
[2024-01-31, 17:15:02 UTC] {{standard_task_runner.py:104}} ERROR - Failed to execute job 3217 for task map_endpoint_task ('NoneType' object has no attribute 'lower'; 7464)
[2024-01-31, 17:15:02 UTC] {{local_task_job_runner.py:225}} INFO - Task exited with return code 1
[2024-01-31, 17:15:02 UTC] {{taskinstance.py:2653}} INFO - 0 downstream tasks scheduled from follow-on schedule check
barbarahui commented 9 months ago

@christinklez I've fixed this bug so that you can continue harvesting. However, we should definitely run a report later before pushing collections to production to check for records without required fields (title, type).

Nuxeo seems to have some half-created records in it, for example:

https://nuxeo.cdlib.org/nuxeo/nxpath/default/asset-library/UCB/UCB%20EDA/jekyll/6827693808047028933@view_documents?tabIds=%3A&conversationId=0NXMAIN4

We can have the mapper skip this record so it doesn't get into the index, but I'm guessing you'd also like to be alerted about these kinds of records so that you can tell the user to delete them from Nuxeo?

barbarahui commented 9 months ago

I created a issue for having the nuxeo mapper handle these half-created records: https://github.com/ucldc/rikolti/issues/755

christinklez commented 9 months ago

Thank you so much, @barbarahui! I was able to generate validation reports for these two repositories. Thanks!