Open cbizon opened 3 years ago
This needs to be fixed. Looking at results from our benchmarks (particularly furosemide vs edema) we are not getting any furosemide omnicorp results b/c of this issue. In particular, it has 2 PUBCHEM.COMPOUND ids. The one in omnicorp and the one that we are querying over.
Consider this query:
Currently, this returns a bunch of chemicals, normalized to pubchem ids. Omnicorp knows about pubchem ids, but I guess because the names are different in pubchem there are lots of cases where we have results in omnicorp for chebi but not for pubchem.
Originally, I was thinking that omnicorp overlay should look in the equivalent identifiers on the input graph and query the cache/postgres for those identifiers as well.
But I think that's wrong - first, you only get back counts, so if you get results for 2 equivalent identifiers, there's no good way to combine them or decide between them. Second, it makes a lot of (probably repeated) double querying. Now I think we should resolve this upstream when we build the omnicorp database and cache. All that we need to do is normalize identifiers where we still have the actual pubmed ids so that we can combine things.
The downside of this approach is that it will tie the cache to the normalization and biolink prefix ordering.