XML driver harvesting only the first element

This was a good find. It looks like the XML driver is pretty broken at the moment for all providers. If you run the entire dataset bin/get harvard scw > harvard.csv and then count how many times the id values appear you can see that there are duplicate rows, and

>>> import pandas
>>> df = pandas.read_csv('harvard.csv')
>>> df
          id  ...                                            preview
0   22535138  ...  https://nrs.harvard.edu/urn-3:FHCL:35350111?wi...
1   22535185  ...  https://nrs.harvard.edu/urn-3:FHCL:35350205?wi...
2   10898498  ...  https://nrs.harvard.edu/urn-3:FHCL:23018086?wi...
3   10898499  ...  https://nrs.harvard.edu/urn-3:FHCL:23018088?wi...
4   10898505  ...  https://nrs.harvard.edu/urn-3:FHCL:23018100?wi...
..       ...  ...                                                ...
95  16199346  ...  https://nrs.harvard.edu/urn-3:FHCL:27735221?wi...
96  16199347  ...  https://nrs.harvard.edu/urn-3:FHCL:27735229?wi...
97  20329473  ...  https://nrs.harvard.edu/urn-3:FHCL:32602011?wi...
98  20329474  ...  https://nrs.harvard.edu/urn-3:FHCL:32602013?wi...
99  20329475  ...  https://nrs.harvard.edu/urn-3:FHCL:32602015?wi...

[100 rows x 12 columns]

>>> df.id.value_counts()
22535138    10
22535185    10
10898498    10
10898499    10
10898505    10
16199346    10
16199347    10
20329473    10
20329474    10
20329475    10

The same is true for bin/get aims aims except there are 144 duplicate rows...

I've added a simple test to first capture what we think should be the correct behavior over in https://github.com/sul-dlss/dlme-airflow/pull/245

sul-dlss / dlme-airflow

XML driver harvesting only the first element #243