monarch-initiative / monarch-legacy

Monarch web application and API
BSD 3-Clause "New" or "Revised" License
42 stars 37 forks source link

why is Sp7 a 100% similarity to Melnick-needles Syndrome? #392

Closed nlwashington closed 8 years ago

nlwashington commented 10 years ago

I don't understand what it means for mouse gene Sp7 to have a 100% similarity score to Melnick-Needles Syndrome, http://monarchinitiative.org/disease/OMIM_309350 especially when the top human hit is only 88% similar to itself (why isn't it the self-comparison 100%)?

@cmungall can you explain? This doesn't make any sense to me.

Also, I don't understand how the top zebrafish hit, that only shares failure to thrive/decreased growth rate, is a 100% hit.

cmungall commented 10 years ago

The scores are dynamically calculated and are relative to a background set. Thus if you're searching against mouse, the top mouse will likely get on the order of 100

Self-matches should be 100. One possibility is that the data/ontologies in owlsim are not in sync with what is coming from fed.

Actions:

  1. The background set issue is confusing. The fix should not be hard, but I'd rather not delve there this close to release. Will do some investigating.
  2. As soon as the closure index is done, we will freeze everything, redump datasets for owlsim, to rule out sync issues
  3. Will do some additional tests on self-queries to ensure this is robust
cmungall commented 10 years ago

Will recruit @hdietze to help

nlwashington commented 10 years ago

ok, so one thing i just noticed is the discrepancy with the data (as chris suggested), and we have to figure out what to do here, because it's at the intersection of the data and nif query behavior. and it's very confusing.

for example, http://localhost:8080/disease/OMIM_304120

this one indicates 95 phenotypes on it's phenotype tab. while all of those phenotypes seem to indicate they are drawn from OMIM:304120, many are actually also for ORPHANET:669.

see this query in the NIF interface: https://neuinfo.org/mynif/search.php?q=OMIM:304120&t=indexable&nif=nlx_151835-1 or for the underlying data: http://beta.neuinfo.org/services/v1/federation/data/nlx_151835-1.csv?q=OMIM:304120&exportType=data

what is happening is that when we dump the data for owlsim, we are treating OMIM:304120 as distinct from ORPHANET:669. but when we do the nif query for OMIM:304120, they are both being returned because OMIM:304120 is being indexed for the ORPHANET:669 records (because that identifier is referenced as a "publication".

in the phenogrid, they use the phenotypes that are retrieved by a query to monarch (which in turn queries nif) for that disease id, which has the combined 95 phenotypes for both diseases. however, what is loaded into owlsim for that identifier is only the subset that are annotated directly to OMIM:304120. therefore, there are more records being shipped off to the owlsim server for comparison, leading to the self-comparison looking like it isn't identical.

so, is the problem the index? the nif query behavior? the dumping for owlsim? missing equivalence axioms? filtering in the phenogrid? or what is shipped off to owlsim comparison engine?

nlwashington commented 10 years ago

@ccondit, this would require a new kind of filtered query for services, as in https://support.crbs.ucsd.edu/browse/NIF-10865

nlwashington commented 10 years ago

ok, the NIF ticket has been fixed. we can now do specific filter queries with closures, like:

http://beta.neuinfo.org/services/v1/federation/data/nlx_151835-1.json?filter=phenotype_id:HP:0000707&exportType=data&includeSubclasses=true (compared to this, which only gives exact matches) http://beta.neuinfo.org/services/v1/federation/data/nlx_151835-1.json?filter=phenotype_id:HP:0000707&exportType=data&includeSubclasses=false

so, for the example above the filter query gives 95 results: http://beta.neuinfo.org/services/v1/federation/data/nlx_151835-1.json?filter=disorder_id:OMIM:304120&exportType=data&includeSubclasses=true&limit=200

while the general query gives 137: http://beta.neuinfo.org/services/v1/federation/data/nlx_151835-1.json?q="OMIM:304120"&exportType=data&includeSubclasses=true&limit=200

nlwashington commented 10 years ago

it looks like the self-matches are showing up as either 100 (or 99) percent now based on the d0ec923 fix, so that's good. unfortunately, the mouse and zebrafish hits are showing up as >100%, which is still really awkward. @cmungall should we fix that normalization for this release?

jmcmurry commented 8 years ago

@nlwashington @cmungall ; can this be closed?