Exclude some synonym types from duplicate_exact_synonym report query

allenbaron commented 9 months ago

Resolves #1175

[x] docs/ have been added/updated
[ ] tests have been added/updated
[x] mvn verify says all tests pass
[ ] mvn site says all JavaDocs correct
[x] CHANGELOG.md has been updated

As discussed in issue #1175, this PR excludes exact synonyms annotated with the abbreviation (OMO:0003000) and acronym (OMO:0003012) synonym type from the duplicate_exact_synonym test of ROBOT report. It also now ignores case when making comparisons.

Exclusion of abbreviations and acronyms should be backward compatible and only incur a time hit for those not using them, while ignoring case will bring up new errors.

I'm happy to change this PR as needed and will attempt to update the documentation and tests if I can (I have no experience with Java).

allenbaron commented 9 months ago

All tests passed but it doesn't look like there's one specifically for the output of the duplicate_exact_synonym query.

After checking that ROBOT built correctly with this PR, I ran some ROBOT report tests (default profile) comparing the current release (1.9.5) and the build based on this PR on both the DO's edit file (with and without acronym annotations) and the uberon.owl file.

All the results came out as expected: some new case-insensitive matches were identified in the DO (oops, 😅) and the acronym exclusion worked as expected. The execution time for 5 runs each can be seen in the plot and summaries below (cpu = user + sys; not showing the run against DO with the acronym annotations added but it was the same as shown below for the doid-edit.owl file before I added them). I think the increased time is reasonable and expect most people won't really notice.

There was one problem I can't explain or fix: the values column of the report output was empty for the robot.jar with the new duplicate_exact_synonym query robot.jar built on this PR (executed from upper dir of this repo as java -jar bin/robot.jar report -i <path_to_owl_file> -o DEL.tsv). It's not a problem with the query; using the query command with the same jar file produced the expected output.

Time Summaries

### CPU Time Summary (user + sys) input | version | cpu_mean | cpu_sd | cpu_min | cpu_max | cpu_diff -- | -- | -- | -- | -- | -- | -- doid-edit | report-1.9.5 | 36.16 | 0.8 | 33.74 | 38.57 | 1 doid-edit | report-dev | 40.35 | 1.16 | 36.87 | 43.84 | 1.12 uberon | report-1.9.5 | 102.32 | 12.23 | 65.62 | 139.02 | 1 uberon | report-dev | 111.5 | 13.91 | 69.78 | 153.23 | 1.09 ### Real Time Summary input | version | real_mean | real_sd | real_min | real_max | real_diff -- | -- | -- | -- | -- | -- | -- doid-edit | report-1.9.5 | 11.57 | 0.18 | 11.02 | 12.13 | 1 doid-edit | report-dev | 12.62 | 0.31 | 11.69 | 13.55 | 1.09 uberon | report-1.9.5 | 36.03 | 0.33 | 35.04 | 37.03 | 1 uberon | report-dev | 45.73 | 1.07 | 42.51 | 48.94 | 1.27

allenbaron commented 9 months ago

I agree that it was a pretty smart way to solve the problem but it wasn't mine. The query is essentially the one @anitacaron shared from Uberon with only slight modification to improve performance and additionally exclude acronym. Having her review it is a good idea.

allenbaron commented 9 months ago

In this query I decided to pass the property up from the subquery to improve performance slightly. In thinking about this just now while working on another query, it occurred to me that this might not work as desired in a specific scenario.

Here's the scenario: An ontology uses both IAO:0000118 and oio:hasExactSynonym and has the same synonym once with each of these properties on different terms. These synonyms would not be identified by this query as errors since they use different properties (false negative). Essentially, passing the property up prevents cross-property identification of duplicate synonyms.

Does this matter?

I've actually been wondering about the inclusion of IAO:0000118. I left it in because it was in the original duplicate_exact_synonym query but I've wondered if it should be removed, since it appears that IAO:0000118 is the annotation property grouping (i.e. parent) for all of the oboInOwl synonym terms, which include broad, narrow and related types. Based on comments I've heard about annotation properties, my assumption is that annotation property hierarchy doesn't mean anything, so maybe it's more about how IAO:0000118 is used in practice?

DO & Uberon don't use IAO:0000118 but RO, BFO, SO, GENO, and a number of other ontologies do appear to use it.

matentzn commented 9 months ago

@allenbaron very good thinking - I totally missed this.

Here is my position:

The query may be a tiny bit more permissive on the case that an entity has both an obo:hasExactSynonym and an IAO:0000118 associated with it. A bit more permissive wont annoy anyone.

The case is not very frequent in the OBO World:

 PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
  PREFIX rh:<http://rdf.rhea-db.org/>

  SELECT DISTINCT ?cls ?x ?y WHERE {
    ?cls <http://purl.obolibrary.org/obo/IAO_0000118> ?x .
    ?cls <http://www.geneontology.org/formats/oboInOwl#hasExactSynonym> ?y .
  } LIMIT 1000

Yields 25 results on Ubergraph and ca. 5128 cases on ontobee, the vast majority of which are some stray EFO classes, and some CHEBI and NCBITaxon classes.

I personally think that if people choose to use both of these properties, they intend to duplicate them for interoperability reasons. So indeed, I believe they should not raise a red flag, as it currently does.

My position is that this "scenario" you identified actually makes the situation better, not worse.

ontodev / robot

Exclude some synonym types from duplicate_exact_synonym report query #1179

Time Summaries