Open aaron-collier opened 2 years ago
Focusing on tests to start.
@lwrubel one area of refactoring to think about is at https://github.com/sul-dlss/dlme-airflow/blob/main/dlme_airflow/tasks/harvest_report.py#L188
This is loading the whole ndjson file before moving on. These can be large files, anywhere from a couple of MBs to a couple of hundred MBs.
One bug I'm working on will strip blank lines on this load, but not focusing on the refactor there. Happy to discuss tomorrow.
Checklist:
agg_data_provider_collection_id
seems to have a bug in the way average values are counted. AUB is getting 10.27 values per record. As far as I can tell this should be just 1 value per record, making the average also 1. The code might be counting every letter, instead of every word. The average is 4.0 for aims and the value in the field is always 'aims'.to_field some_dlme_field, column('some_column_name'),...
There are lines that look like 'wr_dc_rights' => [column('rights')],
but I'm ok with ignoring these for now. I'm also ok with dropping the third column of this crosswalk so that it just shows in and out fields and doesn't try to explain transformations.
https://codeclimate.com/github/sul-dlss/dlme-airflow/dlme_airflow/tasks/harvest_report.py
For context, this needs a major refactor. It was written initially piece by piece and run manually one collection at a time. It would fail from time to time. The code captures the general functionality needed but it needs tests, it probably has some bugs, and it could probably benefit from performance improvements.