surfedushare / pol-harvester

A repository that harvests different sources for content
2 stars 0 forks source link

Investigate the "multiple technical locations" problem #134

Closed fako closed 4 years ago

fako commented 4 years ago

There are some materials that have multiple locations. We're wondering how often this occurs and if we need to adjust our approach in some way.

fako commented 4 years ago

I couldn't really check (yet) if there are differences between the API and OAIPMH, but in the API there are only about 60 materials that have location duplicates and 8 have no location at all.

If you read the NL-LOM specification than apparently it's ok to have multiple locations, but the preferred locations should be at the top. So in that case our code is doing fine, because it takes the first location it encounters (within a record).

However we can doubt if in our example case (id=oai:surfsharekit.nl:5be6dfeb-b9ad-41a8-b4f5-94b9438e4257) the preferred location is truly the first location, because the source location is actually the bottom location. That would be something to discuss with Sharekit.

The case of 0 locations is also valid BTW! That's the case for offline materials. Perhaps we should enquire if the 8 missing things (also some in Sharekit) are truly offline materials. Would be true for books I guess.

jelmerderonde commented 4 years ago

Can you get me a list with some examples?

fako commented 4 years ago

Sure thing boss ;) This is the list of 0 locations:

fako commented 4 years ago

This is a CSV with all multi locations multi_locations.csv.zip

fako commented 4 years ago

I see a pattern! I think you'll spot it pretty quickly too :$