Closed jacobthill closed 2 years ago
This was a good find. It looks like the XML driver is pretty broken at the moment for all providers. If you run the entire dataset bin/get harvard scw > harvard.csv
and then count how many times the id
values appear you can see that there are duplicate rows, and
>>> import pandas
>>> df = pandas.read_csv('harvard.csv')
>>> df
id ... preview
0 22535138 ... https://nrs.harvard.edu/urn-3:FHCL:35350111?wi...
1 22535185 ... https://nrs.harvard.edu/urn-3:FHCL:35350205?wi...
2 10898498 ... https://nrs.harvard.edu/urn-3:FHCL:23018086?wi...
3 10898499 ... https://nrs.harvard.edu/urn-3:FHCL:23018088?wi...
4 10898505 ... https://nrs.harvard.edu/urn-3:FHCL:23018100?wi...
.. ... ... ...
95 16199346 ... https://nrs.harvard.edu/urn-3:FHCL:27735221?wi...
96 16199347 ... https://nrs.harvard.edu/urn-3:FHCL:27735229?wi...
97 20329473 ... https://nrs.harvard.edu/urn-3:FHCL:32602011?wi...
98 20329474 ... https://nrs.harvard.edu/urn-3:FHCL:32602013?wi...
99 20329475 ... https://nrs.harvard.edu/urn-3:FHCL:32602015?wi...
[100 rows x 12 columns]
>>> df.id.value_counts()
22535138 10
22535185 10
10898498 10
10898499 10
10898505 10
16199346 10
16199347 10
20329473 10
20329474 10
20329475 10
The same is true for bin/get aims aims
except there are 144 duplicate rows...
I've added a simple test to first capture what we think should be the correct behavior over in https://github.com/sul-dlss/dlme-airflow/pull/245
Currently the xml driver only harvests the first element with a given name. It should harvest all of them. To verify see the metadata here and using this branch, run
bin/get harvard scw --limit 20 > data.csv
There are multiple elements under
.//mods:subject/mods:topic
but only the first is being harvested.