sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

Refactor harvest_report to improve maintainability and test coverage #120

Open aaron-collier opened 2 years ago

aaron-collier commented 2 years ago

https://codeclimate.com/github/sul-dlss/dlme-airflow/dlme_airflow/tasks/harvest_report.py

For context, this needs a major refactor. It was written initially piece by piece and run manually one collection at a time. It would fail from time to time. The code captures the general functionality needed but it needs tests, it probably has some bugs, and it could probably benefit from performance improvements.

lwrubel commented 1 year ago

Focusing on tests to start.

aaron-collier commented 1 year ago

@lwrubel one area of refactoring to think about is at https://github.com/sul-dlss/dlme-airflow/blob/main/dlme_airflow/tasks/harvest_report.py#L188

This is loading the whole ndjson file before moving on. These can be large files, anywhere from a couple of MBs to a couple of hundred MBs.

One bug I'm working on will strip blank lines on this load, but not focusing on the refactor there. Happy to discuss tomorrow.

jacobthill commented 1 year ago

Checklist: