monarch-initiative / mondo-ingest

Coordinating the mondo-ingest with external sources
https://monarch-initiative.github.io/mondo-ingest/
5 stars 3 forks source link

Handling multiple `config/*property-map*.sssom.tsv`s #482

Closed joeflack4 closed 5 months ago

joeflack4 commented 5 months ago

Overview

There are multiple config/*property-map*.sssom.tsv files. De-proliferate these.

Sub-task list

Sub-task details

2. Implement solution

a. Single file (if allowable, though there may be some weird ROBOT issue prohibiting it) b. Merge all mappings into 1 file, and have a second, empty file (to get around ROBOT issue) c. Keep property-map-1 and property-map-2 as-is, and then add other ones, e.g.icd11foundation-property-map file into one of these. One disadvantage of that is this.

Additional details about solutions: https://github.com/monarch-initiative/mondo-ingest/pull/434#discussion_r1544921463

joeflack4 commented 5 months ago

Context: this comment

@twhetzel I'm making this one high priority just because it is such a quick change to make.

joeflack4 commented 5 months ago

And also one advantage of splitting them up is that we can drop --allow-missing-entities true, and if the source changes and a mapping is no longer there, we'll know about it because it will error.

joeflack4 commented 5 months ago

There's also the possible problem that if we merge them into 1 file and leave the other one blank, then when ROBOT fixes this problem, it might get annoyed by the blank file? Maybe not.

joeflack4 commented 5 months ago

We've decided this issue is a bit of a distraction. We can always re-open it if needed, especially if ROBOT fixes the issue where 2 property maps / more than one are arbitrarily required.

joeflack4 commented 5 months ago

Nico wrote:

there cant be a single one due to some problems with ROBOT rename

@matentzn BTW, I couldn't find an issue for this in the robot GitHub.If there's not one there, should I make one?

matentzn commented 5 months ago

If you like, the issue would go something like this:

Add --allow-duplicate-entities as option in ROBOT rename.

In many pipelines, especially ETL, we want to unify a few different properties like their:label, skos:prefLabel, wikidata:name to a single one, eg rdfs:label. This is currently not possible as the rename command fails hard multiple properties are mapped to a single one. The reverse is obviously bogus (renaming 1 property to many), but it would be nice if we could bypass this precaution with an optional flag, --allow-duplicate-entities, similar as to how we treat missing entities --allow-missing-entities.

joeflack4 commented 4 months ago

I may be following you, but let me check.

Cases:

  1. Many -> 1
  2. 1 -> Many

Are you saying that --allow-duplicate-entities is something that would enable (1) or (2)? It seems like it'd be for (2). But our problem with mondo-ingest involves (1). So wouldn't I need to create 2 GitHub issues, one for each of these cases?

Example files from mondo-ingest It looks like for mondo-ingest, we only have case (1). We're trying to map multiple fields to oboInOwl:source and oboInOwl:hasExactSynonym. And that's why we split into two files.

property map TSV files

`property-map-1.sssom.tsv`: ```tsv subject_id object_id http://purl.org/dc/elements/1.1/source oboInOwl:source http://www.w3.org/2004/02/skos/core#prefLabel http://www.w3.org/2000/01/rdf-schema#label http://www.w3.org/2004/02/skos/core#altLabel oboInOwl:hasExactSynonym ``` `property-map-2.sssom.tsv`: ```tsv subject_id object_id http://www.ebi.ac.uk/efo/definition http://purl.obolibrary.org/obo/IAO_0000115 http://www.ebi.ac.uk/efo/reason_for_obsolescence rdfs:comment http://www.ebi.ac.uk/efo/alternative_term oboInOwl:hasExactSynonym http://purl.obolibrary.org/obo/ECO_0000218 oboInOwl:source ```

So for mondo-ingest, if (1) were to be addressed, then the following could be changed to a single line:

        rename --mappings config/property-map-1.sssom.tsv --allow-missing-entities true \
        rename --mappings config/property-map-2.sssom.tsv --allow-missing-entities true \

--->

        rename --mappings config/property-map.sssom.tsv --allow-missing-entities true \
matentzn commented 4 months ago

Are you saying that --allow-duplicate-entities is something that would enable (1) or (2)? It seems like it'd be for (2). But our problem with mondo-ingest involves (1). So wouldn't I need to create 2 GitHub issues, one for each of these cases?

No it would only enable (1). Many different labels properties unify to a single rdfs:label property. The reverse makes no sense, so no need for a specific issue on this (Imagine you have rdfs:label and rename it to 3 different label properties - that would be duplicating information).

joeflack4 commented 4 months ago

Yeah that makes sense, I was just confused by what you'd written.

OK, so I can make an issue to enable (1) Many -> 1 situations, via --allow-duplicate-entities. Edit: Created!:


I wonder why should Many -> 1 even be an issue? Why should we have to add such a flag? I guess I figured out why.

Like for example, in mondo-ingest, we need to convert http://purl.org/dc/elements/1.1/source and http://purl.obolibrary.org/obo/ECO_0000218 to oboInOwl:source. I wonder: Why would robot care about that? I suppose it is because it expects that the source being modified/queried by the robot command would be a single, coherent ontology. If we're asking it to do a many -> 1 conversion, then that would indicate that there were 2 different properties in the ontology that were duplicative. But for us, we're re-using these property map files across multiple use cases; something that robot isn't really considering by design.

matentzn commented 4 months ago

Imagine you have an ontology that imports another ontology. The first one uses dcterms:contributor the other one oboInOwl:creator. Now before you create a merged release you want to unify both to dcelements:contributor. That's a use case for ROBoT! Thanks for making the issue!

joeflack4 commented 4 months ago

@matentzn hanks. That's not what I was pondering. I wondered why robot doesn't allow these duplicates by default but I think I pondered the reason.