ncbo / bioportal-project

Serves to consolidate (in Zenhub) all public issues in BioPortal
BSD 2-Clause "Simplified" License
7 stars 5 forks source link

ATC: REST API not returning parents for some classes #249

Open jvendetti opened 1 year ago

jvendetti commented 1 year ago

Received a report from end user @piehld that the newly generated CSV file for the ATC ontology (formerly known as UATC) has some empty Parents fields for classes that used to have parents:

I noticed that the "Parents" field is missing for a handful of classes which previously were present in the "UATC" version.

The following examples were provided of classes with missing parents:

http://purl.bioontology.org/ontology/ATC/J05AJ03
http://purl.bioontology.org/ontology/ATC/R05CA02
http://purl.bioontology.org/ontology/ATC/N02BA59
http://purl.bioontology.org/ontology/ATC/R06AA04
http://purl.bioontology.org/ontology/ATC/C02BC
http://purl.bioontology.org/ontology/ATC/B01AX07
http://purl.bioontology.org/ontology/ATC/A06AG20
http://purl.bioontology.org/ontology/ATC/J05AE
http://purl.bioontology.org/ontology/ATC/C01BD06
http://purl.bioontology.org/ontology/ATC/J06AA02

For debugging purposes, I checked the first term in the above list in the UMLS Metathesaurus Browser. The hierarchy pane indicates parents should be present:

Screen Shot 2022-07-28 at 5 42 07 PM

I also did a basic sanity check and looked at the parsing log file for ATC. The latest parsing run shows no errors in the log file.

jvendetti commented 1 year ago

I opened the TTL file that we generated in Protege for the ATC ontology. The hierarchy correctly shows the "dolutegravir" class as a sublcass of "Integrase Inhibitors, antiinfectives for systematic use".

Screen Shot 2022-07-28 at 6 14 08 PM

jvendetti commented 1 year ago

If you issue a REST API call to get the children for the "Integrase Inhibitors, antiinfectives for systematic use" class, the API only returns 3 children:

Screen Shot 2022-07-28 at 6 17 58 PM

REST call: https://data.bioontology.org/ontologies/ATC/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FATC%2FJ05AJ/children?display=prefLabel&display_context=false&display_links=false

mdorf commented 1 year ago

A few additional observations:

  1. The class declarations in the source TTL file for "raltegravir" and "dolutegravir" look identical, with both definitions clearly including the rdfs:subClassOf notations that point to "Integrase Inhibitors, antiinfectives for systematic use" class:
    
    <http://purl.bioontology.org/ontology/ATC/J05AJ01> a owl:Class ;
    skos:prefLabel """raltegravir"""@en ;
    skos:notation """J05AJ01"""^^xsd:string ;
    rdfs:subClassOf <http://purl.bioontology.org/ontology/ATC/J05AJ> ;
    <http://purl.bioontology.org/ontology/ATC/ATC_LEVEL> """5"""^^xsd:string ;
    umls:cui """C1871526"""^^xsd:string ;
    umls:tui """T114"""^^xsd:string ;
    umls:tui """T121"""^^xsd:string ;
    umls:hasSTY <http://purl.bioontology.org/ontology/STY/T114> ;
    umls:hasSTY <http://purl.bioontology.org/ontology/STY/T121> ;
    .

http://purl.bioontology.org/ontology/ATC/J05AJ03 a owl:Class ; skos:prefLabel """dolutegravir"""@en ; skos:notation """J05AJ03"""^^xsd:string ; rdfs:subClassOf http://purl.bioontology.org/ontology/ATC/J05AJ ; http://purl.bioontology.org/ontology/ATC/ATC_LEVEL """5"""^^xsd:string ; umls:cui """C3253985"""^^xsd:string ; umls:tui """T109"""^^xsd:string ; umls:tui """T121"""^^xsd:string ; umls:hasSTY http://purl.bioontology.org/ontology/STY/T109 ; umls:hasSTY http://purl.bioontology.org/ontology/STY/T121 ; .


Yet, when looking at the properties for each of these classes in the class endpoint in BioPortal, the `rdfs:subClassOf` relationship of "dolutegravir" is not shown:

https://data.bioontology.org/ontologies/ATC/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FATC%2FJ05AJ01?display=prefLabel,properties&no_links=true&no_context=true
<img width="569" alt="Screen Shot 2022-08-04 at 3 01 11 PM" src="https://user-images.githubusercontent.com/2042070/182960285-6df2bb55-85cf-4b2e-8105-f2150bfe60d1.png">

https://data.bioontology.org/ontologies/ATC/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FATC%2FJ05AJ03?display=prefLabel,properties&no_links=true&no_context=true
<img width="562" alt="Screen Shot 2022-08-04 at 3 01 35 PM" src="https://user-images.githubusercontent.com/2042070/182960311-a0b387ee-cf7d-4e33-abb1-61abed3d4979.png">
mdorf commented 1 year ago

I re-processed the ATC ontology, and the subclasses now appear correctly.

jvendetti commented 1 year ago

Reopening, as we received a follow up message from @piehld enumerating more classes with missing parents:

http://purl.bioontology.org/ontology/ATC/C02LX -> missing_parent -> http://purl.bioontology.org/ontology/ATC/C02L http://purl.bioontology.org/ontology/ATC/D01AE54 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/D01AE http://purl.bioontology.org/ontology/ATC/B06AA03 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/B06AA http://purl.bioontology.org/ontology/ATC/J06A -> missing_parent -> http://purl.bioontology.org/ontology/ATC/J06 http://purl.bioontology.org/ontology/ATC/S03A -> missing_parent -> http://purl.bioontology.org/ontology/ATC/S03 http://purl.bioontology.org/ontology/ATC/A10AE05 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/A10AE http://purl.bioontology.org/ontology/ATC/G03AB06 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/G03AB http://purl.bioontology.org/ontology/ATC/A16AB05 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/A16AB http://purl.bioontology.org/ontology/ATC/L04AA39 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/L04AA http://purl.bioontology.org/ontology/ATC/A02BA51 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/A02BA http://purl.bioontology.org/ontology/ATC/C01BD07 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/C01BD http://purl.bioontology.org/ontology/ATC/A06AC53 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/A06AC http://purl.bioontology.org/ontology/ATC/A01AD06 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/A01AD http://purl.bioontology.org/ontology/ATC/C10AB02 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/C10AB http://purl.bioontology.org/ontology/ATC/M01AE01 -> missing_parent -> http://purl.bioontology.org/ontology/ATC/M01AE

I spot-checked the first class in the list and confirmed that the REST API returned an empty set for the parents:

https://data.bioontology.org/ontologies/ATC/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FATC%2FC02LX/parents

{  }

Just as a sanity check, I fully reprocessed ATC in production and cleared the caches. After doing so, I spot checked all of the above classes in the BioPortal web application and they all seem to appear as intended in the tree hierarchy. The REST call I listed above also returns parents now.

Before marking this as resolved, I'd like to hear back from Dennis about whether he's seeing any other anomalies.

jvendetti commented 1 year ago

@piehld reports:

... there's just one more that appears to be missing its parent:

http://purl.bioontology.org/ontology/ATC/A03CB01 –> missing parent: A03CB

A REST API call confirms this:

https://data.bioontology.org/ontologies/ATC/classes/http%3A%2F%2Fpurl.bioontology.org%2Fontology%2FATC%2FA03CB01/parents

{ }

So - basically we've processed this ontology three separate times, and each time there appears to be some small set of classes with missing parents. The TTL file still doesn't appear to be the root of the issue. I downloaded the latest version and you can see that Protege is able to construct the tree properly:

Screen Shot 2022-08-05 at 2 48 13 PM

There also doesn't seem to be an issue with the owlapi.xrdf intermediary file we generate. it contains the expected subClassOf declaration:

<!-- http://purl.bioontology.org/ontology/UATC/A03CB01 -->

<Class rdf:about="http://purl.bioontology.org/ontology/UATC/A03CB01">
  <rdfs:subClassOf rdf:resource="http://purl.bioontology.org/ontology/UATC/A03CB"/>
</Class>    

Out of curiosity, I reprocessed ATC in our staging environment and wasn't able to reproduce the issue. At this staging URL, you can see the class correctly positioned in the tree:

Screen Shot 2022-08-05 at 3 02 03 PM

jvendetti commented 1 year ago

Information from @piehld about how they're testing:

... the way I've been checking these is through one of our organization's PyPI packages ...

# Install package
pip install rcsb.utils.chemref

# Download the test script
curl -O https://raw.githubusercontent.com/rcsb/py-rcsb_utils_chemref/master/rcsb/utils/tests-chemref/testAtcProvider.py

# Run the test (will refetch latest CSV and perform some processing)
python3 testAtcProvider.py
mdorf commented 1 year ago

In my system (macOS Monterey), the correct install command was:

python3 -m pip install rcsb.utils.chemref
mdorf commented 1 year ago

I got an error running the test script:

 ▲ ~ ▶ python3 testAtcProvider.py
testReadAtcInfo (__main__.AtcProviderTests) ... INFO:rcsb.utils.chemref.AtcProvider:ATC fetch status is True
INFO:rcsb.utils.chemref.AtcProvider:Length of name dictionary 6440
INFO:rcsb.utils.chemref.AtcProvider:Length of parent dictionary 6440
INFO:rcsb.utils.chemref.AtcProvider:ATC cache status True data length 6567 columns ['Class ID', 'Preferred Label', 'Synonyms', 'Definitions', 'Obsolete', 'CUI', 'Semantic Types', 'Parents', 'ATC LEVEL', 'Is Drug Class', 'Semantic type UMLS property'] names 6440 parents 6440
INFO:rcsb.utils.chemref.AtcProvider:nD 6440 pD 6440
ERROR:rcsb.utils.chemref.AtcProvider:Failing for 'A03CB01' with ''
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/rcsb/utils/chemref/AtcProvider.py", line 110, in getIdLineage
    pt = self.__atcD["parents"][pt]
KeyError: ''
INFO:root:length of tree list 6441
ok

----------------------------------------------------------------------
Ran 1 test in 0.230s

OK
piehld commented 1 year ago

Hi @mdorf, thanks for all your work on this. Yes, that's the error that indicated to me that A03CB01 is still missing its parent.

mdorf commented 1 year ago

I downloaded the ATC TTL file from our Staging server, where ATC appears to have parsed properly. I created a new submission of ATC in production using that file and kicked off its processing. After the processing completed and our internal caches cleared, the Python script yielded this result:

▲ ~ ▶ python3 testAtcProvider.py
testReadAtcInfo (__main__.AtcProviderTests) ... INFO:rcsb.utils.chemref.AtcProvider:ATC fetch status is True
INFO:rcsb.utils.chemref.AtcProvider:Length of name dictionary 6440
INFO:rcsb.utils.chemref.AtcProvider:Length of parent dictionary 6440
INFO:rcsb.utils.chemref.AtcProvider:ATC cache status True data length 6567 columns ['Class ID', 'Preferred Label', 'Synonyms', 'Definitions', 'Obsolete', 'CUI', 'Semantic Types', 'Parents', 'ATC LEVEL', 'Is Drug Class', 'Semantic type UMLS property'] names 6440 parents 6440
INFO:rcsb.utils.chemref.AtcProvider:nD 6440 pD 6440
INFO:root:length of tree list 6440
ok

----------------------------------------------------------------------
Ran 1 test in 0.251s

OK

 ▲ ~ ▶

Does that mean it ran successfully, or could there be other cases that the script isn't catching?

piehld commented 1 year ago

@mdorf This looks great, thank you! I am getting the same result too.

mdorf commented 1 year ago

I loaded the version of ATC.ttl that resulted in missing parents into our Staging server and re-processed it there. I then pointed the Python script (testAtcProvider.py) to the staging server by hacking my local copy of its underlying library, AtcProvider.py. The result appears to be positive:

△ ~ ▶ python3 testAtcProvider.py
testReadAtcInfo (__main__.AtcProviderTests) ... INFO:rcsb.utils.chemref.AtcProvider:ATC fetch status is True
INFO:rcsb.utils.chemref.AtcProvider:Length of name dictionary 6440
INFO:rcsb.utils.chemref.AtcProvider:Length of parent dictionary 6440
INFO:rcsb.utils.chemref.AtcProvider:ATC cache status True data length 6567 columns ['Class ID', 'Preferred Label', 'Synonyms', 'Definitions', 'Obsolete', 'CUI', 'Semantic Types', 'Parents', 'ATC LEVEL', 'Is Drug Class', 'Semantic type UMLS property'] names 6440 parents 6440
INFO:rcsb.utils.chemref.AtcProvider:nD 6440 pD 6440
INFO:root:length of tree list 6440
ok

----------------------------------------------------------------------
Ran 1 test in 0.218s

OK