sul-dlss / dlme-airflow

This is a new repository to capture the work related to the DLME ETL Pipeline and establish airflow
Apache License 2.0
1 stars 0 forks source link

XML driver harvesting only the first element #243

Closed jacobthill closed 2 years ago

jacobthill commented 2 years ago

Currently the xml driver only harvests the first element with a given name. It should harvest all of them. To verify see the metadata here and using this branch, run bin/get harvard scw --limit 20 > data.csv

There are multiple elements under .//mods:subject/mods:topic but only the first is being harvested.

edsu commented 2 years ago

This was a good find. It looks like the XML driver is pretty broken at the moment for all providers. If you run the entire dataset bin/get harvard scw > harvard.csv and then count how many times the id values appear you can see that there are duplicate rows, and

>>> import pandas
>>> df = pandas.read_csv('harvard.csv')
>>> df
          id  ...                                            preview
0   22535138  ...  https://nrs.harvard.edu/urn-3:FHCL:35350111?wi...
1   22535185  ...  https://nrs.harvard.edu/urn-3:FHCL:35350205?wi...
2   10898498  ...  https://nrs.harvard.edu/urn-3:FHCL:23018086?wi...
3   10898499  ...  https://nrs.harvard.edu/urn-3:FHCL:23018088?wi...
4   10898505  ...  https://nrs.harvard.edu/urn-3:FHCL:23018100?wi...
..       ...  ...                                                ...
95  16199346  ...  https://nrs.harvard.edu/urn-3:FHCL:27735221?wi...
96  16199347  ...  https://nrs.harvard.edu/urn-3:FHCL:27735229?wi...
97  20329473  ...  https://nrs.harvard.edu/urn-3:FHCL:32602011?wi...
98  20329474  ...  https://nrs.harvard.edu/urn-3:FHCL:32602013?wi...
99  20329475  ...  https://nrs.harvard.edu/urn-3:FHCL:32602015?wi...

[100 rows x 12 columns]

>>> df.id.value_counts()
22535138    10
22535185    10
10898498    10
10898499    10
10898505    10
16199346    10
16199347    10
20329473    10
20329474    10
20329475    10

The same is true for bin/get aims aims except there are 144 duplicate rows...

I've added a simple test to first capture what we think should be the correct behavior over in https://github.com/sul-dlss/dlme-airflow/pull/245