jokergoo / simona

Semantic Similarity in Bio-Ontologies
https://jokergoo.github.io/simona/
Other
15 stars 1 forks source link

Missing ontology terms #6

Open bschilder opened 6 months ago

bschilder commented 6 months ago

Hi again!,

I've noticed something a bit strange when importing ontologies as ontology_DAG objects. There seems to be some terms that are available on the OLS but not when I import the file with simona.

ont <- simona::import_ontology("http://purl.obolibrary.org/obo/uberon.owl")
sum(grepl("UBERON:0001155",ont@terms))
# [1] 0

I can confirm both are pulling from the same remote OWL file. https://www.ebi.ac.uk/ols4/ontologies/uberon

I can also confirm that the term is searchable and not deprecated on OLS: https://www.ebi.ac.uk/ols4/ontologies/uberon/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0001155

This doesn't seem to be specific to UBERON, as I've noticed similar issues with CL. https://www.ebi.ac.uk/ols4/ontologies/cl

Do you have an idea of what might be going on here?

Thanks!, Brian

> sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
  [1] RColorBrewer_1.1-3        ggdendro_0.1.23          
  [3] rstudioapi_0.15.0         jsonlite_1.8.8           
  [5] shape_1.4.6               magrittr_2.0.3           
  [7] GlobalOptions_0.1.2       fs_1.6.3                 
  [9] vctrs_0.6.5               memoise_2.0.1.9000       
 [11] ggtree_3.10.0             rstatix_0.7.2            
 [13] gh_1.4.0                  htmltools_0.5.7          
 [15] progress_1.2.3            curl_5.2.0               
 [17] broom_1.0.5               gridGraphics_0.5-1       
 [19] htmlwidgets_1.6.4         httr2_1.0.0              
 [21] lubridate_1.9.3           plotly_4.10.4            
 [23] cachem_1.0.8              networkD3_0.4            
 [25] igraph_2.0.1.1            mime_0.12                
 [27] lifecycle_1.0.4           iterators_1.0.14         
 [29] pkgconfig_2.0.3           Matrix_1.6-5             
 [31] R6_2.5.1                  fastmap_1.1.1            
 [33] shiny_1.8.0               clue_0.3-65              
 [35] digest_0.6.34             aplot_0.2.2              
 [37] colorspace_2.1-0          ggnewscale_0.4.10        
 [39] patchwork_1.2.0           S4Vectors_0.40.2         
 [41] rprojroot_2.0.4           grr_0.9.5                
 [43] ggpubr_0.6.0              timechange_0.3.0         
 [45] fansi_1.0.6               httr_1.4.7               
 [47] KGExplorer_0.99.0         abind_1.4-5              
 [49] compiler_4.3.1            here_1.0.1               
 [51] bit64_4.0.5               withr_3.0.0              
 [53] doParallel_1.0.17         backports_1.4.1          
 [55] orthogene_1.9.1           carData_3.0-5            
 [57] viridis_0.6.5             homologene_1.4.68.19.3.27
 [59] dendextend_1.17.1         maps_3.4.2               
 [61] ggsignif_0.6.4            MASS_7.3-60.0.1          
 [63] rappdirs_0.3.3            rjson_0.2.21             
 [65] scatterplot3d_0.3-44      piggyback_0.1.5          
 [67] tools_4.3.1               ape_5.7-1                
 [69] httpuv_1.6.14             glue_1.7.0               
 [71] rols_2.30.0               nlme_3.1-164             
 [73] promises_1.2.1            grid_4.3.1               
 [75] cluster_2.1.6             generics_0.1.3           
 [77] gtable_0.3.4              tidyr_1.3.1              
 [79] data.table_1.15.0         hms_1.1.3                
 [81] tidygraph_1.3.1           xml2_1.3.6               
 [83] car_3.1-2                 utf8_1.2.4               
 [85] BiocGenerics_0.48.1       foreach_1.5.2            
 [87] pillar_1.9.0              stringr_1.5.1            
 [89] yulab.utils_0.1.4         babelgene_22.9           
 [91] pals_1.9                  later_1.3.2              
 [93] circlize_0.4.15           dplyr_1.1.4              
 [95] treeio_1.26.0             lattice_0.22-5           
 [97] bit_4.0.5                 tidyselect_1.2.0         
 [99] ComplexHeatmap_2.18.0     gitcreds_0.1.2           
[101] gridExtra_2.3             IRanges_2.36.0           
[103] stats4_4.3.1              Biobase_2.62.0           
[105] matrixStats_1.2.0         visNetwork_2.1.2         
[107] stringi_1.8.3             yaml_2.3.8               
[109] lazyeval_0.2.2            ggfun_0.1.4              
[111] codetools_0.2-19          tibble_3.2.1             
[113] ggplotify_0.1.2           Polychrome_1.5.1         
[115] cli_3.6.2                 xtable_1.8-4             
[117] munsell_0.5.0             dichromat_2.0-0.1        
[119] Rcpp_1.0.12               mapproj_1.2.11           
[121] gprofiler2_0.2.2          png_0.1-8                
[123] parallel_4.3.1            simona_1.0.10            
[125] ellipsis_0.3.2            ggplot2_3.4.4            
[127] prettyunits_1.2.0         viridisLite_0.4.2        
[129] tidytree_0.4.6            scales_1.3.0             
[131] purrr_1.0.2               crayon_1.5.2             
[133] GetoptLong_1.0.5          rlang_1.1.3              
[135] rvest_1.0.3
bschilder commented 6 months ago

Regarding robot

Ok, so something else I've noticed: setting the path to robot myself (via this function which downloads robot from https://github.com/ontodev/robot/releases) yields different results than running simona::import_ontology twice in a row (first time with and error). I didn't realize that simona is setting the path of robot after it fails the first time.

Cell Ontology example

Here's another example from the Cell Ontology.

Attempt 1

ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl.owl")
Parsing [Term] sections in the obo file [15950/15950]ns in the obo file [10000/15950]ng [Term] sections in the obo file [9000/15950]ing [Term] sections in the obo file [8000/15950]ing [Term] sections in the obo file [7000/15950]ing [Term] sections in the obo file [6000/15950]ing [Term] sections in the obo file [5000/15950]ing [Term] sections in the obo file [4000/15950]ing [Term] sections in the obo file [3000/15950]ing [Term] sections in the obo file [2000/15950]ing [Term] sections in the obo file [1000/15950]d.obo.gz' --check false
remove 187 obsolete terms
There are more than one root:
  BFO:0000002, BFO:0000003, CL:0000015, CL:0000019, CL:0000021,
    and other 222 terms ...
  A super root (~~all~~) is added.
[CHEBI:36080 ~ PR:000000001 ~ CHEBI:36080]
Error: Found isolated rings (one path is listed above). Set `remove_rings = TRUE` to remove them.

Attempt 2

Here I download robot and set the path to it myself.

KGExplorer:::get_ontology_robot()
ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl.owl", remove_rings=TRUE)
grep("CL:0002494",ont@terms)
# [1] 0
Parsing [Term] sections in the obo file [15950/15950]950]g [Term] sections in the obo file [10000/15950]ng [Term] sections in the obo file [9000/15950]ing [Term] sections in the obo file [8000/15950]ing [Term] sections in the obo file [7000/15950]ing [Term] sections in the obo file [6000/15950]ing [Term] sections in the obo file [5000/15950]ing [Term] sections in the obo file [4000/15950]ing [Term] sections in the obo file [3000/15950]ing [Term] sections in the obo file [2000/15950]ing [Term] sections in the obo file [1000/15950]9.obo.gz' --check false
remove 187 obsolete terms
There are more than one root:
  BFO:0000002, BFO:0000003, CL:0000015, CL:0000019, CL:0000021,
    and other 222 terms ...
  A super root (~~all~~) is added.
Removed 749 terms in isolated rings.

But cardiocytes (CL:0002494) is indeed a term in the current CL: https://www.ebi.ac.uk/ols4/ontologies/cl/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FCL_0002494?lang=en

Attempt 3

This time, I'll try using the method of running simona::import_ontology twice.


try({
ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl.owl", remove_rings=TRUE)
})
ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl.owl", remove_rings=TRUE)
grep("CL:0002494",ont@terms)
# [1] 1213
Parsing [Term] sections in the obo file [15950/15950]950]g [Term] sections in the obo file [10000/15950]ng [Term] sections in the obo file [9000/15950]ing [Term] sections in the obo file [8000/15950]ing [Term] sections in the obo file [7000/15950]ing [Term] sections in the obo file [6000/15950]ing [Term] sections in the obo file [5000/15950]ing [Term] sections in the obo file [4000/15950]ing [Term] sections in the obo file [3000/15950]ing [Term] sections in the obo file [2000/15950]ing [Term] sections in the obo file [1000/15950]9.obo.gz' --check false
remove 187 obsolete terms
There are more than one root:
  BFO:0000002, BFO:0000003, CL:0000015, CL:0000019, CL:0000021,
    and other 222 terms ...
  A super root (~~all~~) is added.
Removed 749 terms in isolated rings.

Success!

So it seems my attempts to avoid the initial error with simona::import_ontology is actually causing more problems than it's resolving. Would it possible to have simona::import_ontology detect and install robot without producing an error the first time?

bschilder commented 6 months ago

That said, i'm still noticing missing terms when using the OBO file directly from the CL GitHub: https://github.com/obophenotype/cell-ontology/releases

ont <- simona::import_ontology("https://github.com/obophenotype/cell-ontology/releases/download/v2024-02-13/cl-base.obo")
"CL:0002494" %in% ont@terms
# FALSE
arsing [Term] sections in the obo file [2925/2925]sing [Term] sections in the obo file [1000/2925]2024-02-13/cl-base.obo")
remove 186 obsolete terms
There are more than one root:
  CL:0000000, CL:0000014, CL:0000015, CL:0000018, CL:0000019,
    and other 188 terms ...
  A super root (~~all~~) is added.
bschilder commented 5 months ago

Any idea of what might be going on here @jokergoo ? I'm in the process of publishing several papers that revolve around the use of simona and want to make sure there's not any issues before we move forward.

jokergoo commented 5 months ago

The error of the path of robot.jar has been fixed. I just forgot to update the variable which saves the path after robot.jar is downloaded.

For the missing terms, that was a stupid bug. it is something like x[l], but I wrote as x[!l] so many terms were missing.

Now the two bugs are all fixed. Please update from GitHub.

> ont = import_obo("~/Downloads/cl-base.obo")
Parsing [Typedef] sections in the obo file [5/5]
remove 2 obsolete terms
Parsing [Term] sections in the obo file [2925/2925]
remove 186 obsolete terms
There are more than one root:
  CL:0000006, CL:0000034, CL:0000039, CL:0000048, CL:0000056,
    and other 102 terms ...
  A super root (~~all~~) is added.
> "CL:0002494" %in% ont@terms
[1] TRUE

And

> ont = import_ontology("http://purl.obolibrary.org/obo/cl.owl", remove_rings=TRUE)
`robot_jar` was not set. Download `robot.jar` from GitHub...
trying URL 'https://github.com/ontodev/robot/releases/download/v1.9.5/robot.jar'
Content type 'application/octet-stream' length 92575534 bytes (88.3 MB)
=================grep("CL:0002494",ont@terms)
=================================
downloaded 88.3 MB

Downloading http://purl.obolibrary.org/obo/cl.owl...
Converting file2fb870b88517_cl.owl to the obo format.
  '/usr/bin/java'  -jar '/private/var/folders/g3/f2y6rp510nxf3t5sj6h902bc0000gr/T/Rtmp4CqZ8g/robot_temp_2fb835ce2469.jar' convert --input '/private/var/folders/g3/f2y6rp510nxf3t5sj6h902bc0000gr/T/Rtmp4CqZ8g/file2fb870b88517_cl.owl' --format obo --output '/var/folders/g3/f2y6rp510nxf3t5sj6h902bc0000gr/T//Rtmp4CqZ8g/file2fb86229e2d1.obo.gz' --check false
Parsing [Typedef] sections in the obo file [315/315]
remove 2 obsolete terms
Parsing [Term] sections in the obo file [15950/15950]
remove 187 obsolete terms
There are more than one root:
  CL:0000006, CL:0000034, CL:0000037, CL:0000039, CL:0000048,
    and other 337 terms ...
  A super root (~~all~~) is added.
> grep("CL:0002494",ont@terms)
[1] 592
bschilder commented 5 months ago

Awesome! I'll try it out, thanks

jokergoo commented 5 months ago

Just wait, I found another bug...

jokergoo commented 5 months ago

Just found I haven't considered the following tag in the obo file:

intersection_of
jokergoo commented 5 months ago

I would say, the obo/owl formats are more complex than I thought... I am not an expert in this field. I worked with GO very often but not with other ontologies.

It seems the intersection_of does not provide the subclass information, according to the EBI OLS website. If you use the cl.obo while not cl-base.obo (or corresponding .owl) file, all the subclasses will be there.

Currently, there are the following three ways to process ontology files.

  1. import_obo(): directly process the .obo file
  2. import_ontology(): if the input is .owl, it calls robot.jar to internally convert to .obo, then use import_obo() to import.
  3. import_owl(): I have some R code which can directly parse the XML file.
> ont1 = import_obo("~/Downloads/cl.obo", remove_cyclic_paths = TRUE, remove_rings = TRUE)
> ont2 = import_ontology("~/Downloads/cl.owl", remove_cyclic_paths = TRUE, remove_rings = TRUE)
> ont3 = import_owl("~/Downloads/cl.owl", remove_cyclic_paths = TRUE, remove_rings = TRUE)

And if you only restrict the DAG object to CL terms:

> ont1 = dag_filter(ont1, terms = grep("^CL:", dag_all_terms(ont1), value = TRUE))
> ont2 = dag_filter(ont2, terms = grep("^CL:", dag_all_terms(ont2), value = TRUE))
> ont3 = dag_filter(ont3, terms = grep("^CL:", dag_all_terms(ont3), value = TRUE))

With the new github version, you can also filter the namespace by:

> ont1 = dag_filter(ont1, namespace = "CL")
> ont2 = dag_filter(ont2, namespace = "CL")
> ont3 = dag_filter(ont3, namespace = "CL")

Then the three object ont1, ont2 and ont3 are basically the same:

> ont1
An ontology_DAG object:
  Source: cl, releases/2024-02-13
  2731 terms / 3812 relations
  Root: CL:0000000
  Terms: CL:0000000, CL:0000001, CL:0000005, CL:0000006, ...
  Max depth: 14
  Avg number of parents: 1.40
  Avg number of children: 1.42
  Aspect ratio: 35.91:1 (based on the longest distance from root)
                68.4:1 (based on the shortest distance from root)
  Relations: is_a

With the following columns in the metadata data frame:
  id, short_id, name, namespace, definition
> ont2
An ontology_DAG object:
  Source: cl, releases/2024-02-13
  2731 terms / 3812 relations
  Root: CL:0000000
  Terms: CL:0000000, CL:0000001, CL:0000005, CL:0000006, ...
  Max depth: 14
  Avg number of parents: 1.40
  Avg number of children: 1.42
  Aspect ratio: 35.91:1 (based on the longest distance from root)
                68.4:1 (based on the shortest distance from root)
  Relations: is_a

With the following columns in the metadata data frame:
  id, short_id, name, namespace, definition
> ont3
An ontology_DAG object:
  Source: Cell Ontology, 2024-02-13
  2731 terms / 3813 relations
  Root: CL:0000000
  Terms: CL:0000000, CL:0000001, CL:0000005, CL:0000006, ...
  Max depth: 14
  Avg number of parents: 1.40
  Avg number of children: 1.42
  Aspect ratio: 35.91:1 (based on the longest distance from root)
                68.4:1 (based on the shortest distance from root)
  Relations: is_a

With the following columns in the metadata data frame:
  id, short_id, name, namespace, definition

You still need to update the package from GitHub. I made some small changes.

bschilder commented 5 months ago

Thanks for the updates @jokergoo I'm also not an expert in constructing/parsing ontologies, but use them quite a lot myself. cc'ing some people from Monarch/HPO with more expertise than myself who might be able you help guide you. @cmungall @pnrobinson @matentzn

matentzn commented 5 months ago

"intersection_of" is syntax in OBO for equivalent class statements - better not handle these if you dont know exactly what they mean (it does mean "AND"), so you can assume that all intersections together correspond to one big equivalent class statement with lots of AND AND statements. Lucky for you, technically, you can use this as an isa but this is really not what general tools should be doing.

There is a big push in OBO to make sure that the x-base.obo/owl files include all subclass statements, not just x.owl/obo. This is not yet true though for all ontologies, but it is for CL and Mondo for example.

cmungall commented 5 months ago

Nico is correct, you can ignore intersectionof.

On Thu, Apr 4, 2024 at 10:31 AM Nico Matentzoglu @.***> wrote:

"intersection_of" is syntax in OBO for equivalent class statements - better not handle these if you dont know exactly what they mean (it does mean "AND"), so you can assume that all intersections together correspond to one big equivalent class statement with lots of AND AND statements. Lucky for you, technically, you can use this as an isa but this is really not what general tools should be doing.

There is a big push in OBO to make sure that the x-base.obo/owl files include all subclass statements, not just x.owl/obo. This is not yet true though for all ontologies, but it is for CL and Mondo for example.

— Reply to this email directly, view it on GitHub https://github.com/jokergoo/simona/issues/6#issuecomment-2037793865, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOIHW35KLOUOEPTSNUDY3WE7BAVCNFSM6AAAAABD7WVXL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZXG44TGOBWGU . You are receiving this because you were mentioned.Message ID: @.***>

bschilder commented 4 months ago

Hey @jokergoo, thanks again for the updates. Just tried using the current dev version of simona and running into a couple issues. I think it's related to some of the changes meant to address the above issues.

no slot of name "alternative_terms" for this object of class "ontology_DAG"

First, the latest version of simona doesn't seem to be back-compatible with ontology_DAG objects created with the older versions. Using functions like simona::shortest_distances_via_NCA on older objects gives the error:

Error in term_to_node_id(dag, terms, strict = FALSE) : 
  no slot of name "alternative_terms" for this object of class "ontology_DAG"

Would be nice to gracefully handle older objects by not using slots that don't exist in the object.

Missing hasAlternativeId

Overall I'm getting far fewer missing IDs with the dev version of simona, which is great! But there's still a couple of cases where this comes up, such as "CL:0000111" in the Cell Ontology. This is listed under hasAlternativeId: https://ols.monarchinitiative.org/ontologies/upheno_patterns/terms?iri=http://purl.obolibrary.org/obo/CL_2000032

Is hasAlternativeId a slot that's available in the OBO/OWL file? If so, is it something you could consider in your mapping?

jokergoo commented 4 months ago

@bschilder Did you use the version from GitHub?

> dag = import_obo("~/workspace/ontology/OBOFoundry/cl/cl-basic.obo")
> dag@alternative_terms["CL:0000111"]
  CL:0000111
"CL:2000032"
> term_to_node_id(dag, "CL:0000111")
[1] 2393
> term_to_node_id(dag, "CL:2000032")
[1] 2393

Using functions like simona::shortest_distances_via_NCA on older objects

You cannot use it on older objects because the definition of the ontology_DAG class has been changed. You need to regenerate it.

bschilder commented 4 months ago

@bschilder Did you use the version from GitHub?

Yes, but it looks like you've made some additional changes since I last installed. Currently you're on 1.1.14 I'm using 1.1.13. Just updated to the newer version

Not sure where to find the exact version of the cl-basic.obo you're using, but here's an example that uses a version we can both access:

ont <- simona::import_ontology("http://purl.obolibrary.org/obo/cl/releases/2024-04-05/cl.owl", remove_cyclic_paths = TRUE, remove_rings = TRUE)

In my original report, this is how i was checking whether the term was available.

"CL:0000111" %in% ont@terms # FALSE
"CL:0000111" %in% ont@alternative_terms # FALSE
"CL:0000111" %in% names(ont@alternative_terms) # TRUE

But it seems @terms only includes the main IDs, not the alternative IDs. Is that intentional? Is there some unified way to grab all IDs, or do you recommend using unique(ont@terms, names(ont@alternative_terms)) to get the complete list? I have some use cases where I filter input terms to only those that the ontology_DAG will recognize to avoid throwing errors.

I can confirm that other downstream functions are able to use the alt IDs. So things are looking good in this regard!

simona::shortest_distances_via_NCA(ont, terms = "CL:0000111")

Screenshot 2024-04-15 at 15 52 47

term_to_node_id isn't a exported function in simona. i think this is coming from an internal function accessible with simona:::term_to_node_id(). Is that correct?

> dag = import_obo("~/workspace/ontology/OBOFoundry/cl/cl-basic.obo")
> dag@alternative_terms["CL:0000111"]
  CL:0000111
"CL:2000032"
> term_to_node_id(dag, "CL:0000111")
[1] 2393
> term_to_node_id(dag, "CL:2000032")
[1] 2393

Using functions like simona::shortest_distances_via_NCA on older objects

You cannot use it on older objects because the definition of the ontology_DAG class has been changed. You need to regenerate it.

Ok, but if that's the case then it would be good to return an error that lets users know this. Otherwise, it's not obvious what the issue is. I only figured it out because I'm involved in this thread. Another option would be to provide a way to update old ontology_DAG objects to the new version. I imagine these sorts of issues may pop up again as simona changes over time.