Closed johnbradley closed 2 years ago
The descendants/ancestors
test started failing at the following line when using the v2-beta2
API:
https://github.com/phenoscape/rphenoscape/blob/531601b13fa80fdb8859252bcaaf3eaee4719abc/tests/testthat/test-classif.R#L61-L63
The problem is the following line returns FALSE instead TRUE.
is_descendant("paired fin", c("pelvic fin ray"), includeRels = "part_of")
The paired fin
has IRI of http://purl.obolibrary.org/obo/UBERON_0002534
.
The pelvic fin ray
has an IRI of http://purl.obolibrary.org/obo/UBERON_4300117
.
The production and v2-beta api return 500+ records for finding term descendants of http://purl.obolibrary.org/obo/UBERON_0002534
with parts=true
:
curl -X GET "https://kb.phenoscape.org/api/v2-beta/term/all_descendants?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0002534&parts=true" -H "accept: application/json"
The v2-beta2 api returns 28 records for finding term descendants of http://purl.obolibrary.org/obo/UBERON_0002534
with parts=true
:
curl -X GET "https://kb.phenoscape.org/api/v2-beta2/term/all_descendants?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0002534&parts=true" -H "accept: application/json"
If you have jq
installed you can approximate the number of records returned like so:
curl -X GET "https://kb.phenoscape.org/api/v2-beta2/term/all_descendants?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0002534&parts=true" -H "accept: application/json" | jq | grep -c "@id"
Perhaps the parts
argument is always false off in v2-beta2
?
The corpus_size() function returns 0 for taxa and genes.
> corpus_size("taxa")
[1] 0
This is expected to be at least 100. The Swagger UI seems to be updated to use v2-beta2. So you can reproduce this issue here: https://kb.phenoscape.org/apidocs/#/Semantic%20similarity/get_similarity_corpus_size
This can be reproduced from the command line:
$ curl -X GET "https://kb.phenoscape.org/api/v2-beta2/similarity/corpus_size?corpus_graph=http%3A%2F%2Fkb.phenoscape.org%2Fsim%2Ftaxa" -H "accept: application/json"
{"total":0}
The term_freqs()
function fails with a 400 error now:
> phens <- get_phenotypes(entity = "basihyal bone")
> term_freqs(phens$id, as = "phenotype", corpus = "taxa")
Error in get_csv_data(pkb_api("/similarity/frequency"), query = query, :
(400) Bad Request: Request is missing required form field 'path'
We are currently passing terms
and corpus_graph
("http://kb.phenoscape.org/sim/taxa" or "http://kb.phenoscape.org/sim/genes").
The updated /similarity/frequency API only supports terms
and path
.
path: SPARQL property path composed of full IRIS. This is used to connect the data resource to count (RDF graph world) to the ontology world. E.g. /
I think the 'E.g. /' is a rendering problem and should be <http://purl.org/phenoscape/vocab.owl#exhibits_state>/<http://purl.org/phenoscape/vocab.owl#describes_phenotype>
based on the raw swagger yaml.
I am not sure how to include the corpus(taxa or genes) into the path parameter.
The subsumer_matrix()
function fails with a 500 error now:
> subsumer_matrix(c("http://purl.obolibrary.org/obo/UBERON_0000981"))
Error in get_csv_data(pkb_api("/similarity/matrix"), query = queryseq, :
(500) Internal Server Error: There was an internal server error.
The /similarity/matrix endpoint has a new path
parameter but looks to be optional. The path
parameter description from raw swagger.yaml:
description: SPARQL property path composed of full IRIS. This is used to connect the data resource to count (RDF graph world) to the ontology world. E.g.
<http://purl.org/has_state>/<http://purl.org/describes_phenotype>
I tried hard coding the path
parameter with the example value above. The API returned a mostly empty response:
> subsumer_matrix(c("http://purl.obolibrary.org/obo/UBERON_0000981"))
[1] UBERON_0000981
<0 rows> (or 0-length row.names)
@balhoff Please see the above problems I encountered with v2-beta2.
In addition I noticed something strange in swagger.
Some endpoints that support GET and POST have different parameters.
The GET /similarity/frequency endpoint has parameters terms
and path
.
The POST /similarity/frequency endpoint has parameters terms
and corpus_graph
.
I expected both to have the same parameters.
Fix for problem 1: https://github.com/phenoscape/phenoscape-kb-services/pull/472
Fix for swagger issue : https://github.com/phenoscape/phenoscape-kb-services/pull/473
Re: problem 2 and problem 3 — parameters have changed for all services related to similarity corpora. Instead of using an IRI to name one, you provide a SPARQL property path for which the subjects are items in the corpus (e.g. taxa), and the objects are the annotations (e.g. phenotype classes). You can also (optionally) provide a specifier_property
and specifier_value
which can help select the corpus items when you don't want everything that could be a subject of the provided path.
For the previous "taxa" corpus, use:
<http://purl.org/phenoscape/vocab.owl#has_phenotypic_profile>/<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
http://www.w3.org/2000/01/rdf-schema#isDefinedBy
http://purl.obolibrary.org/obo/vto.owl
For the previous "genes" corpus, use:
<http://purl.org/phenoscape/vocab.owl#has_phenotypic_profile>/<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://purl.org/phenoscape/vocab.owl#AnnotatedGene
Problem 4 should be fixed by https://github.com/phenoscape/phenoscape-kb-services/pull/476 (which has been deployed). For the subsumer matrix, you may want to consider allowing a choice of relations to traverse (new feature).
@balhoff can you explain (or link to the documentation that explains) what the specifier parameters are for? I don't recall these from our biweekly discussion, but I may have missed it. Do these essentially act as filters for the initial subject of the property chain? (The parameter name seems rather confusing - can you say where that's coming from?)
Do these essentially act as filters for the initial subject of the property chain?
That's exactly right. I invented these today, so feedback on the name is entirely welcome! I realized that some additional specification was needed to target the corpus items. What do you think?
Also (but perhaps the documentation explains this?) for a path of <http://purl.org/phenoscape/vocab.owl#has_phenotypic_profile>/<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
the second component seems meaningless – isn't all that this means that it's an instance (which being the object of the property already implies) asserted to be of some type. I.e., is this weeding out phenotypic profiles for which a type is not asserted, and if so, why would there be such profiles.
If they're in essence a subject filter, then maybe just call them such? I.e., subject_filter_property
and subject_filter_value
?
The type
predicate is needed to connect to the phenotypes; it's just how things are structured in the triplestore. There's an intermediate "profile" node, which is just a shadow of the taxon node, in between these two predicates. Where should this documentation live? Phenoscape wiki, or phenoscape-kb-services repo? I think we need some general topic docs outside of the swagger docs.
Why not the Swagger docs? Isn't that where someone would go to find it? Of course, if it's lengthy, you could put it on the wiki (but I would use the phenoscape-kb-services repo wiki), and then link to it from the Swagger docs.
The
type
predicate is needed to connect to the phenotypes; it's just how things are structured in the triplestore. There's an intermediate "profile" node, which is just a shadow of the taxon node, in between these two predicates.
So when you say to connect to the phenotypes what you mean is connect to the phenotype class(es) because that, not the instance(s), is what we're interested in and where the semantics are codified.
@hlapp I updated the parameter names as you suggested: https://github.com/phenoscape/phenoscape-kb-services/pull/481
I updated the term_freqs()
function to include the SPARQL property paths (path
, specifier_property
, and specifier_value
) from https://github.com/phenoscape/rphenoscape/issues/235#issuecomment-912814751 above. The test above in problem 3 no longer fails. However a test that passes 189 term IRIs now fails with a 500 error:
Error (test-semsim.R:140:3): profile similarity with Resnik
Error: (500) Internal Server Error: There was an internal server error.
Backtrace:
1. rphenoscape::term_freqs(...) test-semsim.R:140:2
2. rphenoscape::get_csv_data(...) /Users/jpb67/Documents/work/rphenoscape/R/term-weights.R:89:4
The above code uses the POST /similarity/frequency endpoint.
To reproduce in R I do the following:
phens <- get_phenotypes("maxilla", taxon = "Cyprinidae")
subs.mat <- subsumer_matrix(phens$id, .colnames = "label", .labels = phens$label,
preserveOrder = TRUE)
freqs <- term_freqs(rownames(subs.mat), as = "phenotype", corpus = "taxa")
If I reduce the terms IRIs to 185 the API doesn't fail but does take 2m10s. For comparison running the same code against the v2-beta API finishes in 15s.
In testing this out I noticed some data differences between the v2-beta and v2-beta2 API results:
v2-beta v2-beta2
phenotypes found 12 66
subsumer matrix names 896 189
The subsumer matrix is created by calling the /similarity/matrix API endpoint.
The v2-beta version of the /similarity/frequency
API endpoint can handle 896 term IRIs.
v2-beta2 IRIs: iris.txt
@johnbradley could you paste the list of terms here?
@balhoff I updated my comment to include a link to a text file of IRIs.
A test started failing when switching to v2-beta2 API:
Error (test-pk.R:176:3): labels for pre-generated post-comps
Error: cannot take a sample larger than the population when 'replace = FALSE'
Backtrace:
1. base::sample(rownames(subsumer_matrix(phen)), size = 30) test-pk.R:176:2
The R code to reproduce the problem is:
phen <- sample(get_phenotypes("basihyal bone")$id, size = 1)
subs <- sample(rownames(subsumer_matrix(phen)), size = 30)
The test fetches phenotype IRIs /phenotype/query
for "basihyal bone" entity and chooses a random IRI from the results.
The IRI is sent to the /similarity/matrix API which returns fewer IRIs than previously expected.
The test then tries to fetch 30 random IRIs from the returned IRIs which typically fails because less than 30 IRIs are returned.
In v2-beta I see over 300 results returned for all the examples I checked.
Example of only receiving 24 IRI back from /similarity/matrix for a "basihyal bone" phenotype IRI:
curl -X GET "https://kb.phenoscape.org/api/v2-beta2/similarity/matrix?terms=%5B%22http%3A%2F%2Fpurl.org%2Fphenoscape%2Fexpression%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000051%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPATO_0000117%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FRO_0000052%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%22%5D" -H "accept: text/csv" | wc -l
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 5109 100 5109 0 0 33834 0 --:--:-- --:--:-- --:--:-- 33834
24
If you switch to the v2-beta API in the above curl command 385 items are returned.
The phenotype IRI in question:
http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000051%3E+some+%0A++++%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0000117%3E%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0000052%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0011618%3E%29%29
Is the difference in number of returned IRIs is expected? If so I could reduce the sample size.
A test started failing when switching to v2-beta2 API:
Failure (test-pk.R:190:3): labels for pre-generated post-comps
sum(is.na(subs.l$label)) is not less than 1. Difference: 15
The R test code: https://github.com/phenoscape/rphenoscape/blob/f45e325e4e4764e86b565f96cdad173cc17af0bd/tests/testthat/test-pk.R#L180-L190
A quick way to see the data in R is:
get_term_label(rownames(subsumer_matrix("http://purl.obolibrary.org/obo/UBERON_0000981")))$label
The femur IRI is sent to the /similarity/matrix API and from the results 30 IRIs are sampled. The code fetches labels for the 30 IRIs. Any CARO IRIs with no labels are removed. Then the code checks that all remaining IRIs do not have NA for their labels.
I ran the quick R example above using v2-beta and v2-beta2.
In v2-beta 103 IRIs are returned from /similarity/matrix and 102 have labels (The only NA is CARO which the test excludes).
In v2-beta2 136 IRIs are returned from /similarity/matrix and 59 have labels.
Outside of a single CARO IRI the IRIs that do not have labels start with: http://purl.org/phenoscape/term/relation/
Example v2-beta2 IRI that has no label:
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0004288
This IRI is not valid for v2-beta.
Some "has part ..." labels show up in v2-beta but did not show up in v2-beta2.
Should the http://purl.org/phenoscape/term/relation/ IRIs have labels? If not I can filter them out like we do the CARO IRIs.
A test started failing when switching to v2-beta2 API:
Failure (test-freqs.R:65:3): success rate for entity subsumer terms
mean(is.na(tt.types)) is not strictly less than 0.1. Difference: 0.441
The R test code:
A quick way to see the IRIs in R is:
> subs.mat <- head(subsumer_matrix(c("http://purl.obolibrary.org/obo/UBERON_4400005","http://purl.obolibrary.org/obo/UBERON_0003097","http://purl.obolibrary.org/obo/UBERON_4000164")))
> subs.mat$tc <- term_category(rownames(subs.mat))
> subs.mat[c("tc")]
tc
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0010000 <NA>
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000002 <NA>
http://purl.obolibrary.org/obo/CARO_0010000 entity
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468 <NA>
http://purl.obolibrary.org/obo/UBERON_0000061 entity
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCARO_0000003 <NA>
In the above example scroll to the right to see the term_category
(tc) for each IRI.
The test looks up IRIs for "fin ray", "dorsal fin", and "caudal fin". Then passes these IRIs to /similarity/matrix. The code then tries to determine the term category for the IRI returned by /similarity/matrix. The term category is determined by looking at the results of /term/all_ancestors and /term/classification for each IRI. The code expects 90% of the IRI to have a term category.
The IRIs that we can't determine term category have the http://purl.org/phenoscape/term/relation/
prefix.
Example IRI:
http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCARO_0000000
Fetch ancestors for a relation IRI:
curl -X GET "https://kb.phenoscape.org/api/v2-beta2/term/all_ancestors?iri=http%3A%2F%2Fpurl.org%2Fphenoscape%2Fterm%2Frelation%2Fhttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%2Fhttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FCARO_0000000&parts=false" -H "accept: application/json"
{"results":[]}
Fetch term classification for a relation IRI:
curl -X GET "https://kb.phenoscape.org/api/v2-beta2/term/classification?iri=http%3A%2F%2Fpurl.org%2Fphenoscape%2Fterm%2Frelation%2Fhttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%2Fhttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FCARO_0000000" -H "accept: application/json" | jq
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 367 100 367 0 0 2352 0 --:--:-- --:--:-- --:--:-- 2352
{
"label": "http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCARO_0000000",
"subClassOf": [],
"equivalentTo": [],
"superClassOf": [],
"@id": "http://purl.org/phenoscape/term/relation/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000050/http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FCARO_0000000"
}
Should these ../term/relation IRIs return data for /term/classification and/or /term/all_ancestors? If not is there a way to determine a term category ("entity", "quality", "phenotype", or "taxon") for these IRI?
A test started failing when switching to v2-beta2 API:
Failure (test-semsim.R:74:3): Resnik similarity
all(sm.ic > 0) is not TRUE
The R test code: https://github.com/phenoscape/rphenoscape/blob/f45e325e4e4764e86b565f96cdad173cc17af0bd/tests/testthat/test-semsim.R#L63-L74
So sm.ic
should only have positive values but that is no longer the case in v2-beta2:
The IRI for anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone))) absent
above in sm.ic is:
http://purl.org/phenoscape/expression?value=%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000051%3E+some+%0A++++%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0002000%3E%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0000052%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468%3E%29%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002503%3E+value+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0004529%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000672%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%2529%23e1332d8d-9c88-4a4d-b2c4-04a424b481cd%3E%29%29%29%0A+and+%28%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23phenotype_of%3E+some+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0004529%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000672%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%2529%23e1332d8d-9c88-4a4d-b2c4-04a424b481cd%3E%29
The test uses /phenotype/query with "basihyal bone" and taxon = "Cyprinidae" to create a list of IRIs.
Some of the IRIs returned by /phenotype/query include "absent" in the label.
A sample of these IRIs are sent to the /similarity/matrix API endpoint.
The test calculates Resnik similarity using R code for the matrix returned.
Some of the matrix contains 0 Resnik similarity values that causes the test to fail.
The test calculates Resnik similarity next.
This is done by calculating term frequencies by passing the rownames from the subsumer matrix(IRIs) to the /similarity/frequency
endpoint. The integer values returned from the endpoint are divided by the corpus size.
The values are then passed through -log() and some additional math.
Some of the matrix contains 0 Resnik similarity values that causes the test to fail.
For the an IRI that had 0 Resnik similarity we received 797 for the "frequency score (subsumed items)" returned by /similarity/frequency
:
$ curl -X GET "https://kb.phenoscape.org/api/v2-beta2/similarity/frequency?terms=%5B%22http%3A%2F%2Fpurl.org%2Fphenoscape%2Fexpression%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.org%252Fphenoscape%252Fvocab.owl%2523implies_presence_of%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0001015%253E%22%5D&path=%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23has_phenotypic_profile%3E%2F%3Chttp%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23type%3E&subject_filter_property=http%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23isDefinedBy&subject_filter_value=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2Fvto.owl" -H "accept: text/csv"
http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23implies_presence_of%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0001015%3E,797
797 the size as the corpus of taxa. Since we take log(797/797)
we end up with zero for a term weight.
It seems like the IRIs that are problematic have a label ending in "... absent". These IRIs are coming from /phenotype/query. Should these "absent" IRI be filtered out at some point?
Re: problem 9, this (Resnik similarity score of zero) can only really come about if two terms do not have any common subsumers in the matrix.
This could be because of an error in the /similarity/matrix
endpoint (in that it doesn't return some subsumers that it nonetheless should). It could also be an effect of a correction for unexpected zero or NA
term frequencies being obtained for some subsumers:
https://github.com/phenoscape/rphenoscape/blob/4855e6cc78b81114ea3ab6c5222a623c8caba2a2/R/semsim.R#L249-L259
This may inadvertently for some terms remove the only common subsumer(s) that there are.
It seems more likely that there are some common subsumers that are being returned as from the matrix endpoint, but then erroneously receive no count or a count of zero in the frequencies endpoint.
@hlapp Re: problem 9: I removed the logic that removes rows and the problem persisted. It looks like I missed a pretty big part of what happens in problem 9 ( fetching frequencies from /similarity/frequency
) so I'm going to update my comment above to have better details.
For problem 5—I made a PR to perform many queries instead of one big one: https://github.com/phenoscape/phenoscape-kb-services/pull/489
@johnbradley for problem 6, the reduced number of subsumers in the matrix for a phenotype is expected. You will get more if you add more arguments to the relations parameter. For the phenotype IRI you mentioned, if you add the relation http://purl.org/phenoscape/vocab.owl#phenotype_of_reflexive_part_of
, you get 83 subsumers.
Note to myself—object properties for different situations is one of the topics that needs documentation.
object properties for different situations is one of the topics that needs documentation.
Yes. For example, when would I and would I not want to add http://purl.org/phenoscape/vocab.owl#phenotype_of_reflexive_part_of
. On the surface, it would seem do not add it if I wanted phenotypes only of true parts, rather than of things or any of their parts. But without more visibility into the data model that's practically impossible to verify.
For problem 7, the http://purl.org/phenoscape/term/relation/
do not have labels. However we could handle these specially in the label and term info queries if we think that's a good idea. Those terms are built from two components, each of which typically has a label.
For problem 6: I will send an example relation list to @johnbradley which most closely mimics the previous results.
@johnbradley and @balhoff just to clarify from our discussion: The Resnick similarity between two terms is zero if (a) they have no subsumers in common (in a graph with a root shared by all terms this should never happen), or if (b) the subsumer(s) that they do have in common either are the root term(s) or have the same frequency as the root term (i.e., for which the frequency is equal to the corpus size).
Note that, unlike Jaccard, Resnick cannot distinguish between a root term and a term descending from the root term that has nonetheless the same frequency as the root term. (This means for example that if the only change we made to a graph is adding a line of subsumer terms to a term that currently is the root term in a graph, then Resnick similarities for any pair of terms would be unchanged. Jaccard similarities would change, however, because now we've added terms into the union and intersection sets of subsumers for any pair of terms.)
Hence, if there isn't a bug with frequency calculations, one question is, is it "correct" (however we define this) that for anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone)))
its presence and absence phenotypes should only have common subsumer(s) whose frequency is equal to the corpus size.
Problem 9: The test currently samples 10 rows from the subsumer matrix (subs.mat): https://github.com/phenoscape/rphenoscape/blob/531601b13fa80fdb8859252bcaaf3eaee4719abc/tests/testthat/test-semsim.R#L66-L69 Could the test be removing the only common subsumer for some terms?
Using the idea from #239 I checked jaccard similarity on the sampled subsumer matrix:
> min(jaccard_similarity(subs.mat1))
[1] 0
> min(jaccard_similarity(subs.mat))
[1] 0.02272727
@johnbradley good catch, and it seems your check shows this to indeed be a (the?) problem. The subsampling is there because originally obtaining the frequencies took more time than seemed tolerable. If you disable the subsampling, does the runtime become prohibitive for a test suite?
You can disable the subsampling simply by reassigning subs.mat1
and commenting out as follows:
# subs1 <- rownames(subs.mat)[s]
# subs.mat1 <- subs.mat[s,]
subs.mat1 <- subs.mat
# rownames(subs.mat1) <- subs1
Problem 9: Even after removing subsampling the test is still failing. So I simplified the test to create a resnik similarity grid for two phenotypes.
phenotype1: 'anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone))) absent
http://purl.org/phenoscape/expression?value=%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000051%3E+some+%0A++++%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0002000%3E%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0000052%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468%3E%29%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0002503%3E+value+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0004529%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000672%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%2529%23e1332d8d-9c88-4a4d-b2c4-04a424b481cd%3E%29%29%29%0A+and+%28%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23phenotype_of%3E+some+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0004529%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%250A%2B%2B%2B%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000672%253E%250A%2B%2B%2B%2B%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%2529%2529%23e1332d8d-9c88-4a4d-b2c4-04a424b481cd%3E%29
phenotype2: 'anterior margin and (part_of some basihyal bone) straight
http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FBFO_0000051%3E+some+%0A++++%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FPATO_0002180%3E%0A+++++and+%28%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FRO_0000052%3E+some+%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fsubclassof%3Fvalue%3D%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBSPO_0000671%253E%250A%2Band%2B%2528%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FBFO_0000050%253E%2Bsome%2B%253Chttp%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FUBERON_0011618%253E%2529%23d91f9091-f506-4348-8caf-6760c015fbaa%3E%29%29
The code then produced the following grid:
Resnic Similarity matrix:
[...absent] [...straight]
[...absent] 2.299398 0.000000
[...straight] 0.000000 2.600428
The above matrix is creating by combining the subsumer matrix with the frequency values.
Below is the subsumer matrix with an additional nlog_term_freq column:
...ome basihyal bone))) absent ...ome basihyal bone) straight nlog_term_freq
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0000464> 0 1 0.03164011
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0007844> 0 1 0.06703762
http://purl.org/phenoscape/ex...06-4348-8caf-6760c015fbaa>)) 0 1 2.60042833
http://purl.org/phenoscape/ex...f506-4348-8caf-6760c015fbaa> 0 1 2.60042833
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0002513> 0 1 0.07863668
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0013702> 0 1 0.01047872
http://phenoscape.org/not/htt...9c88-4a4d-b2c4-04a424b481cd> 1 0 1.86006564
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0001630> 0 1 0.00000000
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0001555> 0 1 0.18545498
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0005884> 0 1 0.47170604
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0008895> 0 1 0.20597664
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0011615> 0 1 1.53973049
http://purl.org/phenoscape/ex...ibrary.org/obo/BSPO_0000006> 0 1 0.30376314
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0013701> 0 1 0.01047872
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0000468> 1 1 0.00000000
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0002418> 0 1 0.05388566
http://purl.org/phenoscape/ex...c88-4a4d-b2c4-04a424b481cd>) 1 0 2.29939833
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0002100> 0 1 0.01159660
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0000033> 0 1 0.18796778
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0011618> 0 1 1.64618582
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0004111> 0 1 0.03340196
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0011153> 0 1 0.62730047
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0011614> 0 1 1.27820903
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0001015> 0 1 0.00000000
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0010323> 0 1 0.19218836
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0000153> 0 1 0.07603220
http://purl.org/phenoscape/ex...rary.org/obo/UBERON_0001474> 0 1 0.03458051
If you scroll to the right you can see the only subsumer with 1
for both phenotypes is the 15th subsumer( ending in UBERON_0000468>
). This term also has a -log(term_freq())
of 0. When calculating the Resnic Similarity we multiply these three numbers together.
Details about 15th subsumer IRI:
row 15 subsumer IRI:http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23implies_presence_of%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468%3E
row 15 decoded subsumer IRI:http://purl.org/phenoscape/expression?value=<http://purl.org/phenoscape/vocab.owl#implies_presence_of>+some+<http://purl.obolibrary.org/obo/UBERON_0000468>
Label for UBERON_0000468: "multicellular organism"
> term_freqs("http://purl.org/phenoscape/expression?value=%3Chttp%3A%2F%2Fpurl.org%2Fphenoscape%2Fvocab.owl%23implies_presence_of%3E+some+%3Chttp%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FUBERON_0000468%3E", as = "phenotype", corpus = "taxa")
[1] 1
So these two phenotypes have a common subsumer of "implies_presence_of multicellular organism" but this subsumer has a term_frequency of 1, which -nlog converts to 0.
It seems this shows that the cause of zero Resnick similarity is in the database. It's certainly expected that the frequency of "implies_presence_of some 'multicellular organism'" would be equal to the corpus size for corpus "taxa".
There are nevertheless two things that are surprising, but they're both presumably due to the database content and how it's generated. One is why anatomical projection and (part_of some (posterior margin and (part_of some basihyal bone))) absent
implies (i.e., is a subclass of) "implies_presence_of some 'multicellular organism'". The other is why the two phenotypes don't have closer subsumers, for example "phenotype_of some 'anatomical projection'" and/or "phenotype_of some (part_of some 'basihyal bone')". @balhoff?
@johnbradley which relations are you requesting for Problem 9? I think this may be the cause of missing common subsumers. Also I think I neglected to send you a suggested list to use, is that right?
which relations are you requesting for Problem 9?
When creating the subsumer matrix using the /similarity/matrix endpoint we only specify terms
array. So we are using the default values for relations
and path
. I don't recall a suggested list.
You did give me some defaults for path
, subject_filter_property
, and subject_filter_value
that are being used by the /similarity/frequency
and /similarity/corpus_size
endpoints.
Problem 5 is no longer occurring. Fixed by https://github.com/phenoscape/rphenoscape/issues/235#issuecomment-922040832
The KB v2-beta2 API was replaced by a new version (currently https://dev.phenoscape.org/api/v2-beta
).
I tested the above issues with the With the current baseline-v0.3.0 branch the above tests all pass.
Note the issue mentioned here https://github.com/phenoscape/rphenoscape/issues/235#issuecomment-928004735 was fixed by 10b22066b4ae7268f1af46956a9d316bd4ca11e9.
Closing this issue since we aren't using the v2-beta2 KB API and the items found in this issue have been resolved.
Add support for the v2-beta2 API.
Problems