Open kimdurante opened 2 years ago
@kimdurante would this possibly be resolved by https://github.com/sul-dlss/dor_indexing_app/issues/846 and the new Argo update that allows filtering for items that were not released to SearchWorks/EarthWorks?
Hi @thatbudakguy. I saw this and it looks like a promising fix. I have notified the maps team as well and we are going to test it out. Thanks!
per the GIS team meeting on 11/28:
The Argo facet doesn't solve this because it just gives a report of druids for which the release workflow ran — we have GIS data that doesn't show up in EarthWorks despite the workflow running. This is probably due to bad data, but finding it is largely dependent on this issue.
Peter Crandall has a script for web scraping that compares druids to determine which ones are actually available in EarthWorks, which is the current workaround — obviously not ideal.
moved from earthworks to argo pending further discussion about how to address this.
Thanks for bringing this to the Argo repo. I think Argo would be the best place to produce a report for the reasons discussed above, namely the already-existing facet that is supposed to represent a release. What I think we will need is better two-way communication between systems. In fact, I don't think there's any two-way communication when a release happens right now. Argo only gets a signal that the Infrastructure systems involved in a release have completed their steps, but no signal that the release actually led to the item appearing in a discovery system.*
As best I know, the release process is carried out through a chain of steps that finally leads to the discovery system on the other end:
What I think is needed is for some signal to come back to Argo at the end of that chain to indicate that the release worked. In my experience, a failed attempt to release can be traced back to one of the intermediate points in the chain - usually either no tag on the Purl XML, or maybe the tag is there but was never picked up, though it sounds like EW has different release requirements than SW - but you have to know the release failed to know that you should start looking.
*SearchWorks release has the same problem, with an extra bit of complexity for the scenario where a catalog record is involved. I've done some custom scripting against SearchWorks to check on releases for particular projects, but have not audited all 800,000+ releases to see how widespread the problem is.
@thatbudakguy - I think we might want some cross team brainstorming here, as my naive thought is to query solr for each druid, but it might be better to get a signal (a push to a queue??) when a record is indexed for EarthWorks or SearchWorks and then we can track which records never get that signal. I suspect more/better minds might come up with something better as a possible solution for knowing whether or not something gets indexed.
Another stupid thought is that the access indexing code could directly call the workflow service -- we have a gem for that and could help set it up if we knew where in the indexing code to do it, blah blah.
ways to know if the release fails:
After discussing this with the access team, we would like to address this earlier in the data lifecycle if possible:
Some known DRUIDs that failed to make it to Earthworks: jr116tb9031 bb847xp3575 bq428sj1172 by513wd0839
After discussing this with the access team, we would like to address this earlier in the data lifecycle if possible
This seems like the best approach. But note that what will be needed if we stick with the current release implementation is a way to prevent DSA (I think DSA is responsible for this) from writing the data that Argo uses to determine if an item is released.
For example, both jr116tb9031 and bb847xp3575 have "releaseTags" in the Cocina that show a release status. Argo will show an item as "released" if the most recent releaseTag has a value of release:true
, as it is here:
"administrative": {
"releaseTags": [
{
"who": "blalbrit",
"what": "self",
"date": "2016-11-22T07:02:11.000+00:00",
"to": "Searchworks",
"release": false
},
{
"who": "lkrueger",
"what": "self",
"date": "2021-12-28T02:04:49.000+00:00",
"to": "Earthworks",
"release": true
},
{
"who": "bergeraj",
"what": "self",
"date": "2023-04-14T01:04:35.000+00:00",
"to": "Earthworks",
"release": true
}
]
If that release will fail and we know it will fail, we could prevent the tag from being written. But if the release fails after the tag is written, then the tag would have to be modified or discarded if a message came back (from purl-fetcher/SW/EW/some other method) indicating failure. Otherwise, how would Argo be able to determine when a releaseTag is accurate and when it is not?
Note that a similar consideration applies to the public XML, which will contain a releaseData
element that will be true or false. Under some circumstances, only the public XML will indicate release status and the Cocina will not.
If we're open to reconsidering how the release process has been implemented, other possibilities may open up.
Here is a list of 477 DRUIDs that are released to EarthWorks but do not appear. I'm not sure if this is all of them but it should be enough to identify some of the major errors.
I suspect most of these failed due to missing coordinates such as this one: https://purl.stanford.edu/xf797pf2604
This one has asterisks instead of degree symbols in the coordinate metadata: https://purl.stanford.edu/vz593ws4651
In other cases, coordinates seem to be in a projection other than EPSG:4326: https://purl.stanford.edu/mw737qf7422
I'll try and conduct a larger analysis of the metadata for these and discuss with the Maps team some possible strategies for fixing these.
Problem: There are at least 334 “problematic” 034 fields in DRMC records in SearchWorks, which has an impact on what is released to EW, along with other metadata issues. There is currently no way to identify what has been released or not on a more specific level.
Need: Ability for catalogers/librarians to generate reports of items released to EarthWorks
Why: Maps are made available in EarthWorks when they have geospatial metadata (these are the 034 and 255 fields in MARC)
It was not always policy to include these fields in the metadata when cataloging, so often this metadata is missing and/or contains errors
When releasing batches of items to Earthworks, Argo will say that all of the druids were accessioned, even if the geospatial metadata is missing
This means that there’s no straightforward way to ensure that all of the records actually made it over to Earthworks
Currently, Peter Crandall created a Python script to scrape the data and generate a list of the druids that made it to Earthworks, but that is not a viable long-term solution
Statements: As map librarians, it’s our responsibility to ensure that there are as few errors in the original metadata as possible. However, mistakes happen, so it is important to provide ways to check and make sure that everything was released properly. Otherwise, those materials are not available to patrons/researchers in Earthworks.
Example: The Caroline Batchelor collection was released to EarthWorks, but all items from the collection were not released due to missing coordinate metadata. https://earthworks.stanford.edu/catalog/stanford-ct961sj2730