jar398 / plotter

Utilities for updating an EOL graphdb
Other
1 stars 1 forks source link

Synonym metadata #6

Open jar398 opened 3 years ago

jar398 commented 3 years ago
KatjaSchulz commented 3 years ago

I'm not sure page id should be tied to the identity of a synonym. Ideally, the identity would be constructed from canonical + supplier + taxonID from supplier.

jar398 commented 3 years ago

Hmmm.... I see how this is the case for the relational DB but we're talking here about the graphdb which currently has no supplier taxonIDs at all. We'd have to make up something Node-like to make the association between supplier taxonID and page id. I sort of thought that was out of scope for the graphdb.

I mean, this could be done, but I think it has very large implications for the graphdb.

jhammock commented 3 years ago

Do we have to rely only on the graph to provide the input for synonym IDs?

jar398 commented 3 years ago

We haven't discussed synonym ids or use cases for them. And my use of 'identity' is not the same as 'identifier' (although one could construct an identifier by knowing a record's identity criteria). I would hope the webapp wouldn't have to consult the graph. If identifier construction were deterministic from properties then the identifier formation method could be replicated where needed (webapp for mysql and neo4j, plotter for neo4j, maybe elsewhere).

We haven't talked about synonym updates or what to do (in the graph) if a single resource provides multiple synonyms with the same name and page. Use cases welcome.

KatjaSchulz commented 3 years ago

I agree it's a good idea to think about use cases and not get lost in pondering about how things should work in an ideal world. And of course, pageID has to be part of the synonym identity in the graph. If the main purpose of synonyms in the graph is to help API users to access trait data based on synonyms, we can probably get away with simplified synonym identities. Users will undoubtedly encounter ambiguities, but even with detailed synonym metadata we wouldn't be able to resolve all of these cases.

"what to do (in the graph) if a single resource provides multiple synonyms with the same name and page."

We'll definitely encounter those, assuming by name you mean canonical name. I don't have concrete examples on hand right now (wouldn't be too hard to track down), but I know I have seen quite a few lists of synonyms where a canonical occurred multiple times with variations in authorships. I don't think we need to worry about these cases in the graph. If you encounter them, I think it would be ok to treat them as duplicates and ignore all but the first instance.

jar398 commented 3 years ago

@jhammock Do you have use cases for synonyms in the graphdb? I'm guessing -

  1. somebody using cypher for exploration or to make tables does a query based on canonical name and wants to come up with a Page node with that name, which might be a synonym. Then the question is, what do they do if there are multiple such Pages? That is, what additional information would they be in a position to provide to help discriminate between them? - my guess is there is none, they have to suffer with the multiple results somehow. (N.b. hierarchy could be used - 'the mollusc with that name' vs. 'the aster with that name'.)
  2. somebody with a cypher query that lists multiple Pages, and for each Page, they want to know what synonyms are known. What information could we provide to help them get further details describing each synonym? ... [e.g. scientific name, provider, id at provider, node id in DH]

If the provider id or provider node id is given I don't know how someone is expected to make use of that information, since nothing in the graphdb is keyed to it.

jhammock commented 3 years ago

Case 1 is the most likely, I think. If there are multiple matches, I think the other thing the users are likely to be able to provide in order to narrow it down will be ancestry. Does that help, or make it hideous? I think suffering the multiple matches is acceptable if that's life.

Case 2 will also happen, I expect. @KatjaSchulz would know more about what these searchers may want.

KatjaSchulz commented 3 years ago

Case 1: ancestors and descendants of the pages would be the most important criteria to help users choose between multiple options.

Case 2: yes, scientific name, provider, id at provider would be the most important metadata here.

Another related use case for synonyms in the graph would be to allow users to understand/investigate EOL mappings of data if the scientific name in the resource is different from the preferred name of the EOL page.

jar398 commented 2 years ago

There was also discussion in February and April in a google doc.

What I have in work in progress is Name graphdb nodes that have rank, canonical name, scientific name, and landmark status, and relationships :accepted_name and :synonym relating Pages to Names.

Some of the Name properties might be stored redundantly on Page nodes (e.g. the canonical name of an accepted Name could be copied to the canonical property of its Page).

Adding more information from the DH DwCA, such as provider designator and provider's id, would be fairly easy. (Harder if providers have to have their own nodes.)

I'm continuing the assumption that there will only ever be a single hierarchy in the graphdb. There needn't be a :supplier for Names since only one resource at a time will supply names. But perhaps I am wrong about this.

jhammock commented 2 years ago

I am with Katja that identity of a Name node ideally would be defined by canonical + supplier + taxonID from supplier. I still don't understand your concern, @jar398 , that "we're talking here about the graphdb which currently has no supplier taxonIDs at all". Is it a problem that this information would need to be fetched from outside the graph?

jar398 commented 2 years ago

As long as the information comes from the DH's DwCA it is easy to include. But the graphdb properties would have to be documented, and we'd probably want to make some effort to ensure that queries and query results continue to work across versions of the DH. For example, if the source/provider designator (occurring in the DwCA) for, say, WoRMS changes, then queries might break or their results may not be interpreted properly, without substantial intervention, because WoRMSv1 is not seen as the same as WoRMSv2. Maybe we'd end up saying buyer beware, but better to say that than nothing. (Changing node ids within a provider is of course something we don't control, but for many of these sources node ids are pretty stable. You do have to have a way to know which source is in play in order to be able to generate and interpret them.)

I don't see a use case where the identity of a Name matters. I was thinking of these as representing what a taxonomist calls a 'name', not a name occurrence of TNU, but something more TNU-like (different Names for one scientificName, one for each source) is OK. It doesn't matter much to me but if there is any criterion or principle that lets us decide when multiple Name nodes would be an error due to redundancy or contradiction (i.e. an identity criterion), that would be a wonderful thing.

If they're more like TNUs maybe they should be called Tnu nodes?

jar398 commented 2 years ago

Hmm. Here is a typical 'source' value from DH 1.1:

trunk:be97d60f-6568-4cba-92e3-9d068a1a85cf,NCBI:2,WOR:6

This says that this node (Page or synonym) in this DH comes (via smasher) from record be97d60f-6568-4cba-92e3-9d068a1a85cf in source trunk, and that two other records aligned with it. Exploding this to three new graphdb nodes (Tnu nodes maybe) seems useful, if there is documentation somewhere as to what trunk, NCBI, and WOR mean (something like this). Or, we could just keep the first.

The DwCA also seems to have a URL in some cases, e.g. this, but this is not present for the previous example.

IIRC one of the smasher outputs is metadata for the sources. Maybe this could go into the graphdb.

KatjaSchulz commented 2 years ago

The DH DwC-A currently has http://rs.tdwg.org/dwc/terms/datasetID values which are persistent across different versions of the same dataset, e.g., we always use WOR for WoRMS, NCBI for NCBI, etc. The datasetID mappings are in a Gdoc.

For some datasets, we have http://rs.tdwg.org/ac/terms/furtherInformationURL values that usually contain the provider taxonID, but not all records have a value for this field.

Getting provider taxonIDs would be a bit complicated with the current DH DwC-A and actually impossible for most synonym records, unless they have a http://rs.tdwg.org/ac/terms/furtherInformationURL value. But we can change that for the next version of the DH, so that the http://purl.org/dc/terms/source field for all records will have a single entry of the form datasetID:taxonID.

jar398 commented 2 years ago

OK, I see, thanks... in DH1.1 it looks like datasetID is always the same as X when source = X:N,Y:M,... so for now I can extract the id N from the source field.

I think it would be really cool if these Name nodes were actually TNUs in the sense used in the TDWG TNC interest group. To make this happen all we'd need is associated data set version or snapshot or timestamp information in the spreadsheet. (Just naming the resource isn't enough, since that isn't a citation if the resource keeps 'changing' as many do.) All or at least most of these sources provide stable URLs to particular archived versions (example example), so it wouldn't be difficult to provide these in an additional column.

Regarding your first message in this issue, yes, page id is not part of the identity of a Name, no matter how it's construed. The pair (datasetID, id-in-dataset) would do it.

KatjaSchulz commented 2 years ago

in DH1.1 it looks like datasetID is always the same as X when source = X:N,Y:M,... so for now I can extract the id N from the source field.

I think that's true for all data sets, except for COL and CLP, which were both derived from Catalogue of Life, and we reference the COL sub-dataset in the datasetID field, e.g.:

taxonID source furtherInformationURL acceptedNameUsageID parentNameUsageID scientificName higherClassification taxonRank taxonomicStatus taxonRemarks datasetID canonicalName EOLid EOLidAnnotations Landmark EOL-000000027296 CLP:a7249de82918047da4fd47282a6cb752 http://www.catalogueoflife.org/col/details/species/id/a7249de82918047da4fd47282a6cb752 EOL-000000027293 Obertrumia gracilis Foissner, 1989 Life|Cellular|Eukaryota|SAR|Alveolata|Ciliophora|Intramacronucleata|Nassophorea|Nassulida|Nassulidae|Obertrumia species valid COL-113 Obertrumia gracilis 31259858

We will clean that up in the next DH version.

All or at least most of these sources provide stable URLs to particular archived versions (example example), so it wouldn't be difficult to provide these in an additional column.

Yes, we could provide links to particular dataset versions for the major providers and even for our own patches since those are versioned on opendata.eol.org. I'll have to to some research to see where people put that information in DwC-A. The source field may actually be the best place for this, but we are already using it for the provider taxonID info. Can you think of an example of a DwC-A resource that has record by record dataset version information?

jar398 commented 2 years ago

I'm playing with this, and having graphdb nodes for TNUs may be overkill. We have to figure out how Name nodes interact with hierarchy deltas and how they affect the time required to apply a delta or advance to a new hierarchy version (#4). My understanding of the requirements and proposed update process is pretty dim but I'll try to hash out one or two possible designs so that we have a better basis for discussion.