jokergoo / simona

Semantic Similarity in Bio-Ontologies
https://jokergoo.github.io/simona/
Other
10 stars 1 forks source link

Error: Too many cyclic paths (> 1000). #8

Open bschilder opened 4 months ago

bschilder commented 4 months ago

When trying to import the unified phenotype ontology (UPHENO), I run into some errors about Too many cyclic paths (> 1000). Is there something different about this ontology? If so, is there a way it can still be imported by simona?

Thanks in advance!

command:

dag <- simona::import_ontology("https://purl.obolibrary.org/obo/upheno/v2/upheno.owl")

output:

Parsing [Term] sections in the obo file [175632/175632]n the obo file [11000/175632][Term] sections in the obo file [10000/175632] [Term] sections in the obo file [9000/175632]g [Term] sections in the obo file [8000/175632]g [Term] sections in the obo file [7000/175632]g [Term] sections in the obo file [6000/175632]g [Term] sections in the obo file [5000/175632]g [Term] sections in the obo file [4000/175632]g [Term] sections in the obo file [3000/175632]g [Term] sections in the obo file [2000/175632]g [Term] sections in the obo file [1000/175632]obo.gz' --check false
remove 1822 obsolete terms
There are more than one root:
  APO:0000001, APO:0000006, APO:0000018, BFO:0000001, BTO:0000000,
    and other 47 terms ...
  A super root (~~all~~) is added.
Error: Too many cyclic paths (> 1000).

Running with the remove_cyclic_paths and/or remove_rings args doesn't seem to help (same error).

dag <- simona::import_ontology("https://purl.obolibrary.org/obo/upheno/v2/upheno.owl", remove_cyclic_paths = TRUE, remove_rings = TRUE)

Session info

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] simona_1.1.14

loaded via a namespace (and not attached):
 [1] ComplexHeatmap_2.18.0 jsonlite_1.8.8        Polychrome_1.5.1      compiler_4.3.1        rjson_0.2.21         
 [6] crayon_1.5.2          promises_1.3.0        Rcpp_1.0.12           Biobase_2.62.0        xml2_1.3.6           
[11] parallel_4.3.1        later_1.3.2           cluster_2.1.6         IRanges_2.36.0        png_0.1-8            
[16] fastmap_1.1.1         mime_0.12             R6_2.5.1              igraph_2.0.3          shape_1.4.6.1        
[21] httr2_1.0.1           BiocGenerics_0.48.1   iterators_1.0.14      GetoptLong_1.0.5      circlize_0.4.16      
[26] shiny_1.8.1.1         RColorBrewer_1.1-3    rlang_1.1.3           httpuv_1.6.15         rols_2.30.2          
[31] GlobalOptions_0.1.2   doParallel_1.0.17     cli_3.6.2             magrittr_2.0.3        digest_0.6.35        
[36] foreach_1.5.2         grid_4.3.1            rstudioapi_0.16.0     xtable_1.8-4          rappdirs_0.3.3       
[41] lifecycle_1.0.4       clue_0.3-65           S4Vectors_0.40.2      scatterplot3d_0.3-44  glue_1.7.0           
[46] codetools_0.2-20      stats4_4.3.1          colorspace_2.1-0      matrixStats_1.3.0     tools_4.3.1          
[51] pkgconfig_2.0.3       htmltools_0.5.8.1   
bschilder commented 4 months ago

Using another distribution of UPHENO gives a different set of errors.

command (setting remove_cyclic_paths = TRUE, remove_rings = TRUE doesn't make a difference):

dag <- simona::import_ontology("https://github.com/obophenotype/upheno/raw/master/upheno.owl", remove_cyclic_paths = TRUE, remove_rings = TRUE)

output:

Downloading https://github.com/obophenotype/upheno/raw/master/upheno.owl...
Converting file8e3674551214_upheno.owl to the obo format.
  '/usr/bin/java'  -jar '/private/var/folders/rd/rbc_wrdj4k3djf3brk6z0_dc0000gp/T/RtmpIPZrCZ/robot_temp_8e3627754717.jar' convert --input '/private/var/folders/rd/rbc_wrdj4k3djf3brk6z0_dc0000gp/T/RtmpIPZrCZ/file8e3674551214_upheno.owl' --format obo --output '/var/folders/rd/rbc_wrdj4k3djf3brk6z0_dc0000gp/T//RtmpIPZrCZ/file8e367d161ab6.obo.gz' --check false
Error in import_obo(output, verbose = verbose, ...) : 
  Cannot find any [Term].

I should mention, in previous versions of simona (sometime before the changes implemented to address #6 ) I was successfully able to import this same ontology.

bschilder commented 2 months ago

Hi @jokergoo this still seems to be an issue. Would really appreciate your help in fixing this.

Thanks, Brian

jokergoo commented 2 months ago

For this error:

Error: Too many cyclic paths (> 1000).

I let the function exit when the number of cyclic paths > 1000 because I would assume the DAG structure is wrong. What do you think about how to deal with such scenarios?

And https://github.com/obophenotype/upheno/raw/master/upheno.owl is empty:

image
jokergoo commented 2 months ago

I think I used an inefficient way to count cyclic paths. The method I used in the package is if there is a subset of cyclic paths, I count all possible combinations of paths that are cyclic, which may generate a big value, for example if 5 terms are completely connected.

jokergoo commented 2 months ago

And it seems there are quite a lot of duplications in the .obo converted from upheno.owl (how import_ontology() does internally).

If you use the owl parser I wrote in the package import_owl(), it can be successfully imported:

> dag = import_owl("~/upheno.owl")
Parsing 421 <owl:ObjectProperty> ...
remove 9 obsolete terms
Parsing 176174 <owl:Class> ...
Parsing 147847 <rdf:Description> ...
remove 2411 obsolete terms
There are more than one root:
  APO:0000001, APO:0000006, APO:0000018, BFO:0000001, BTO:0000000,
    and other 60 terms ...
  A super root (~~all~~) is added.
> dag
An ontology_DAG object:
  Source: http://ontology.com/someuri.owl,
  173643 terms / 328749 relations
  Root: ~~all~~
  Terms: APO:0000001, APO:0000002, APO:0000003, APO:0000004, ...
  Max depth: 41
  Avg number of parents: 1.89
  Avg number of children: 1.95
  Aspect ratio: 730.79:1 (based on the longest distance from root)
                2354:1 (based on the shortest distance from root)
  Relations: is_a

With the following columns in the metadata data frame:
  id, short_id, name, namespace, definition

I will check import_ontology() why there are so many duplications because import_ontology() is the suggested way to import general ontologies in practice.

bschilder commented 2 months ago

Thanks for looking into this @jokergoo

Maybe @nicolevasilevsky and @matentzn would have some insights as to whether this is expected.

I should also mention that there's the OBO Library URL for this some ontology, but seems to have the same issue.

dag <- simona::import_ontology(  "https://purl.obolibrary.org/obo/upheno/v2/upheno.owl", remove_cyclic_paths = TRUE, remove_rings = TRUE)

But as you say, using the import_owl function works:

o=simona::import_owl("https://purl.obolibrary.org/obo/upheno/v2/upheno.owl")

I thought simona::import_ontology simply called the relevant function depending on the input file type, but it seems there's some other differences as well.

jokergoo commented 2 months ago

Now I have updated the package and you can update from GitHub.

The previous way for detecting cyclic paths is: for each node, I go to all its downstream terms and check whether there are cyclic paths (because I assume cyclic paths are rare in a "correctly formatted ontology", which means if A and B are cyclic, i.e. A<->B, and if they are both downstream of C and D:

C->...->A<->B
D->...->A<->B

the cyclic path of A<->B will be reported multiples times for all C/D/C's offsprings/D's offsprings.

Now I have improved the code that A<->B will be only reported once.

In the following code, upheno.obo is converted from upheno.owl by robot.jar (this is what exactly import_ontology() internally does). I also import with the native import_owl(). You can see the two dags, although are not identical, but almost.

# the same as import_ontology("~/upheno.owl")
> dag1 = import_obo("~/upheno.obo", remove_cyclic_paths = TRUE)
Parsing [Typedef] sections in the obo file [443/443]
remove 10 obsolete terms
Parsing [Term] sections in the obo file [175632/175632]
remove 1822 obsolete terms
There are more than one root:
  APO:0000001, APO:0000006, APO:0000018, BFO:0000001, BTO:0000000,
    and other 47 terms ...
  A super root (~~all~~) is added.
Removed 13 cyclic paths.
> dag2 = import_owl("~/upheno.owl")
Parsing 421 <owl:ObjectProperty> ...
remove 9 obsolete terms
Parsing 176174 <owl:Class> ...
Parsing 147847 <rdf:Description> ...
remove 2411 obsolete terms
There are more than one root:
  APO:0000001, APO:0000006, APO:0000018, BFO:0000001, BTO:0000000,
    and other 60 terms ...
  A super root (~~all~~) is added.
> dag1
An ontology_DAG object:
  Source: http://ontology.com/someuri.owl,
  173674 terms / 328845 relations
  Root: ~~all~~
  Terms: 397443, APO:0000001, APO:0000002, APO:0000003, ...
  Max depth: 41
  Avg number of parents: 1.89
  Avg number of children: 1.95
  Aspect ratio: 732.83:1 (based on the longest distance from root)
                2357.19:1 (based on the shortest distance from root)
  Relations: is_a

With the following columns in the metadata data frame:
  id, short_id, name, namespace, definition
> dag2
An ontology_DAG object:
  Source: http://ontology.com/someuri.owl,
  173643 terms / 328749 relations
  Root: ~~all~~
  Terms: APO:0000001, APO:0000002, APO:0000003, APO:0000004, ...
  Max depth: 41
  Avg number of parents: 1.89
  Avg number of children: 1.95
  Aspect ratio: 730.79:1 (based on the longest distance from root)
                2354:1 (based on the shortest distance from root)
  Relations: is_a

With the following columns in the metadata data frame:
  id, short_id, name, namespace, definition
jokergoo commented 2 months ago

I thought simona::import_ontology simply called the relevant function depending on the input file type, but it seems there's some other differences as well.

This is the ultimate goal. Now the importing functions are not perfect. I will make import_ontology() an "entrance function" after I improve import_obo()/import_owl()/import_ttl().

matentzn commented 2 months ago

Hey both! I know some sources of cycles, but it would be best if you could share a handful concrete ones (3-10). This way I can tell you concretely why they exist and how to avoid them!

jokergoo commented 2 months ago

Here they are

[CL:0000164 ~ CL:0000163 ~ CL:0000164]
[FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00005106]
[FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00007694 ~ FBbt:00005106]
[FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00007696 ~ FBbt:00005106]
[FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00007696 ~ FBbt:00007695 ~ FBbt:00005106]
[FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00007696 ~ FBbt:00007697 ~ FBbt:00005106]
[FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00007698 ~ FBbt:00005106]
[FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00007698 ~ FBbt:00110127 ~ FBbt:00005106]
[FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00047736 ~ FBbt:00005106]
[FBbt:00048472 ~ FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00048472]
[FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00052132 ~ FBbt:00005106]
[FBbt:00005106 ~ FBbt:00047097 ~ FBbt:00005106]
[CL:0000556 ~ CL:4033018 ~ CL:0000556]

What I do for removing cycles is to remove the last link.

matentzn commented 2 months ago

These are very surprising cycles. What relationships are you using? Just isa?

jokergoo commented 2 months ago

Only isa. Take [FBbt:00005106 ~ FBbt:00007693 ~ FBbt:00005106] for example:

(the obo file is converted from upheno.owl by robot.jar)

[Term]
id: FBbt:00005106
name: neuron
namespace: fly_anatomy.ontology
def: "Electrically active cell of the nervous system which has an axon and/or dendrite and which is synapsed with other cells and/or has other cells synapsed to it." [https://orcid.org/0000-0002-7073-9172]
def: "Electrically active cell of the nervous system which has an axon and/or dendrite and which is synapsed with other cells and/or has other cells synapsed to it." [FBC:DOS]
subset: cur
subset: EmbDevSlim
subset: FB_gloss
subset: larval_OF
subset: scrnaseq_slim
synonym: "nerve cell" RELATED []
xref: CL:0000540
xref: VFB:FBbt_00005106
is_a: CL:0000540 ! neuron
is_a: FBbt:00007693 {gci_relation="receives_synaptic_input_from_neuron", gci_filler="FBbt:00005124"} ! sensory system neuron
is_a: FBbt:00007694 {gci_filler="FBbt:00005918", gci_relation="receives_synaptic_input_from_neuron"} ! thermosensory system neuron
is_a: FBbt:00007695 {gci_filler="FBbt:00005927", gci_relation="receives_synaptic_input_from_neuron"} ! gustatory system neuron
is_a: FBbt:00007696 {gci_relation="receives_synaptic_input_from_neuron", gci_filler="FBbt:00005922"} ! chemosensory system neuron
is_a: FBbt:00007697 {gci_relation="receives_synaptic_input_from_neuron", gci_filler="FBbt:00005926"} ! olfactory system neuron
is_a: FBbt:00007698 {gci_filler="FBbt:00005919", gci_relation="receives_synaptic_input_from_neuron"} ! mechanosensory system neuron
is_a: FBbt:00047097 {gci_relation="develops_from", gci_filler="FBbt:00047097"} ! primary neuron
is_a: FBbt:00047736 {gci_relation="receives_synaptic_input_from_neuron", gci_filler="FBbt:00004211"} ! visual system neuron
is_a: FBbt:00048472 {gci_relation="receives_synaptic_input_from_neuron", gci_filler="FBbt:00048471"} ! nociceptive system neuron
is_a: FBbt:00052132 {gci_relation="receives_synaptic_input_from_neuron", gci_filler="FBbt:00051439"} ! proprioceptive system neuron
is_a: FBbt:00100318 ! somatic cell
is_a: FBbt:00110127 {gci_relation="receives_synaptic_input_from_neuron", gci_filler="FBbt:00007684"} ! auditory system neuron
intersection_of: CL:0000540 ! neuron
intersection_of: part_of NCBITaxon:7227 ! Drosophila melanogaster
relationship: part_of FBbt:00005093 ! nervous system
property_value: IAO:0000589 "neuron (drosophila)" xsd:string

and

[Term]
id: FBbt:00007693
name: sensory system neuron
namespace: fly_anatomy.ontology
def: "Any neuron (FBbt:00005106) that capable of part of some sensory perception (GO:0007600)." [FBC:MMC]
xref: VFB:FBbt_00007693
is_a: FBbt:00005106 ! neuron
intersection_of: FBbt:00005106 ! neuron
intersection_of: capable_of_part_of GO:0007600 ! sensory perception
relationship: part_of FBbt:00007692 ! sensory system
matentzn commented 2 months ago

😱 oooooooooo.. Nooooo! 😱

is_a: FBbt:00007693 {gci_relation="receives_synaptic_input_from_neuron", gci_filler="FBbt:00005124"} ! sensory system neuron

This is not an isa relation. Scaaary, it makes total sense for someone without deep knowledge of OBO that is building a parser to stumble across this. This is a GCI! It means something like (sorry I will definitely say this wrong, but you get the picture):

"every neuron that receives synaptic input from a sensory neuron is a sensory system neuron"

I am sure @dosumis will tell me off for relating this wrongly, but the point is: this is not a regular isa relation!

EDIT:

Solution:

If the obo is contains any key starting with the word "gci", its probably best to ignore the isa relation

jokergoo commented 2 months ago

@matentzn Thanks for this information! I am not in this field and I mainly work with Gene Ontology.

Do you know or have any resource introducing such "gci"- or related things (link, book, articles)? Then I can integrate and improve this tool.

jokergoo commented 2 months ago

Package updated.

jokergoo commented 2 months ago

id: FBbt:00005106 name: neuron is_a: FBbt:00007693 {gci_relation="receives_synaptic_input_from_neuron", gci_filler="FBbt:00005124"} ! sensory system neuron

"every neuron that receives synaptic input from a sensory neuron is a sensory system neuron"

Ha, now I understand it. Thanks!

matentzn commented 2 months ago

Do you know or have any resource introducing such "gci"- or related things (link, book, articles)? Then I can integrate and improve this tool.

The best place I can point you to is the spec (https://owlcollab.github.io/oboformat/doc/obo-syntax.html#5), in particular section 5.2.2.

My general sense is that it would be better to either invest in the RDFXML reader, or imo even better, in a obographs-json reader. The obographs-json format is more easy to interpret for downstream users than the OBO format. However, I guess practically both are needed...

nicolevasilevsky commented 2 months ago

@matentzn An introduction to GCIs might be a good OBO Academy topic.

dosumis commented 2 months ago

I'd prefer it if OBO conversion didn't try to represent these. It can still be lossless by storing untranslated axioms in the header. Don't know if that's an option in our pipelines though

I also second use of OBO graphs JSON over custom OBO parsers - if that's an option here