sul-dlss / argo

The administrative discovery interface for Stanford's Digital Object Registry
Other
20 stars 5 forks source link

Some DRUIDs show as released to EarthWorks but do not appear in EarthWorks #3893

Open kimdurante opened 2 years ago

kimdurante commented 2 years ago

Problem: There are at least 334 “problematic” 034 fields in DRMC records in SearchWorks, which has an impact on what is released to EW, along with other metadata issues. There is currently no way to identify what has been released or not on a more specific level.

Need: Ability for catalogers/librarians to generate reports of items released to EarthWorks

Why: Maps are made available in EarthWorks when they have geospatial metadata (these are the 034 and 255 fields in MARC)

It was not always policy to include these fields in the metadata when cataloging, so often this metadata is missing and/or contains errors

When releasing batches of items to Earthworks, Argo will say that all of the druids were accessioned, even if the geospatial metadata is missing

This means that there’s no straightforward way to ensure that all of the records actually made it over to Earthworks

Currently, Peter Crandall created a Python script to scrape the data and generate a list of the druids that made it to Earthworks, but that is not a viable long-term solution

Statements: As map librarians, it’s our responsibility to ensure that there are as few errors in the original metadata as possible. However, mistakes happen, so it is important to provide ways to check and make sure that everything was released properly. Otherwise, those materials are not available to patrons/researchers in Earthworks.

Example: The Caroline Batchelor collection was released to EarthWorks, but all items from the collection were not released due to missing coordinate metadata. https://earthworks.stanford.edu/catalog/stanford-ct961sj2730

thatbudakguy commented 2 years ago

@kimdurante would this possibly be resolved by https://github.com/sul-dlss/dor_indexing_app/issues/846 and the new Argo update that allows filtering for items that were not released to SearchWorks/EarthWorks?

kimdurante commented 2 years ago

Hi @thatbudakguy. I saw this and it looks like a promising fix. I have notified the maps team as well and we are going to test it out. Thanks!

thatbudakguy commented 1 year ago

per the GIS team meeting on 11/28:

The Argo facet doesn't solve this because it just gives a report of druids for which the release workflow ran — we have GIS data that doesn't show up in EarthWorks despite the workflow running. This is probably due to bad data, but finding it is largely dependent on this issue.

Peter Crandall has a script for web scraping that compares druids to determine which ones are actually available in EarthWorks, which is the current workaround — obviously not ideal.

thatbudakguy commented 1 year ago

moved from earthworks to argo pending further discussion about how to address this.

andrewjbtw commented 1 year ago

Thanks for bringing this to the Argo repo. I think Argo would be the best place to produce a report for the reasons discussed above, namely the already-existing facet that is supposed to represent a release. What I think we will need is better two-way communication between systems. In fact, I don't think there's any two-way communication when a release happens right now. Argo only gets a signal that the Infrastructure systems involved in a release have completed their steps, but no signal that the release actually led to the item appearing in a discovery system.*

As best I know, the release process is carried out through a chain of steps that finally leads to the discovery system on the other end:

  1. User initiates release from Argo
  2. releaseWF runs (details of what happens here actually differ if the release is on a whole collection or an item)
  3. the "publish" step in the releaseWF updates the Purl for the item to be released with an XML tag that says the item is released. This is the point where SDR systems hand-off to access/discovery systems. I think it may also be when Argo gets the signal that the release happened (via an update to the Argo solr index?).
  4. another system (purl-fetcher?) reads the newly-updated purl, sees the release tag, and then does something that leads the item to appear in the discovery system
  5. the item ultimately appears in the discovery system somewhere between 20-60 minutes after the release is run, if all goes well

What I think is needed is for some signal to come back to Argo at the end of that chain to indicate that the release worked. In my experience, a failed attempt to release can be traced back to one of the intermediate points in the chain - usually either no tag on the Purl XML, or maybe the tag is there but was never picked up, though it sounds like EW has different release requirements than SW - but you have to know the release failed to know that you should start looking.

*SearchWorks release has the same problem, with an extra bit of complexity for the scenario where a catalog record is involved. I've done some custom scripting against SearchWorks to check on releases for particular projects, but have not audited all 800,000+ releases to see how widespread the problem is.

ndushay commented 1 year ago

@thatbudakguy - I think we might want some cross team brainstorming here, as my naive thought is to query solr for each druid, but it might be better to get a signal (a push to a queue??) when a record is indexed for EarthWorks or SearchWorks and then we can track which records never get that signal. I suspect more/better minds might come up with something better as a possible solution for knowing whether or not something gets indexed.

Another stupid thought is that the access indexing code could directly call the workflow service -- we have a gem for that and could help set it up if we knew where in the indexing code to do it, blah blah.

ndushay commented 1 year ago

ways to know if the release fails:

thatbudakguy commented 8 months ago

After discussing this with the access team, we would like to address this earlier in the data lifecycle if possible:

  1. https://github.com/sul-dlss/cocina-models/issues/662
  2. https://github.com/sul-dlss/cocina-models/issues/663
  3. https://github.com/sul-dlss/searchworks_traject_indexer/issues/1299
thatbudakguy commented 8 months ago

Some known DRUIDs that failed to make it to Earthworks: jr116tb9031 bb847xp3575 bq428sj1172 by513wd0839

andrewjbtw commented 8 months ago

After discussing this with the access team, we would like to address this earlier in the data lifecycle if possible

This seems like the best approach. But note that what will be needed if we stick with the current release implementation is a way to prevent DSA (I think DSA is responsible for this) from writing the data that Argo uses to determine if an item is released.

For example, both jr116tb9031 and bb847xp3575 have "releaseTags" in the Cocina that show a release status. Argo will show an item as "released" if the most recent releaseTag has a value of release:true, as it is here:

"administrative": {
    "releaseTags": [
      {
        "who": "blalbrit",
        "what": "self",
        "date": "2016-11-22T07:02:11.000+00:00",
        "to": "Searchworks",
        "release": false
      },
      {
        "who": "lkrueger",
        "what": "self",
        "date": "2021-12-28T02:04:49.000+00:00",
        "to": "Earthworks",
        "release": true
      },
      {
        "who": "bergeraj",
        "what": "self",
        "date": "2023-04-14T01:04:35.000+00:00",
        "to": "Earthworks",
        "release": true
      }
    ]

If that release will fail and we know it will fail, we could prevent the tag from being written. But if the release fails after the tag is written, then the tag would have to be modified or discarded if a message came back (from purl-fetcher/SW/EW/some other method) indicating failure. Otherwise, how would Argo be able to determine when a releaseTag is accurate and when it is not?

Note that a similar consideration applies to the public XML, which will contain a releaseData element that will be true or false. Under some circumstances, only the public XML will indicate release status and the Cocina will not.

If we're open to reconsidering how the release process has been implemented, other possibilities may open up.

kimdurante commented 8 months ago

Here is a list of 477 DRUIDs that are released to EarthWorks but do not appear. I'm not sure if this is all of them but it should be enough to identify some of the major errors.

I suspect most of these failed due to missing coordinates such as this one: https://purl.stanford.edu/xf797pf2604

This one has asterisks instead of degree symbols in the coordinate metadata: https://purl.stanford.edu/vz593ws4651

In other cases, coordinates seem to be in a projection other than EPSG:4326: https://purl.stanford.edu/mw737qf7422

I'll try and conduct a larger analysis of the metadata for these and discuss with the Maps team some possible strategies for fixing these.

druids.csv