ranking-agent / AnswerCoalesce

A TRAPI-compliant component of ARAGORN that groups together similar results.
MIT License
2 stars 0 forks source link

Crummy Enrichments #99

Open cbizon opened 2 years ago

cbizon commented 2 years ago

https://arax.ncats.io/?source=ARS&id=e2800952-aae8-4605-97ce-4cfbc596934e

The query is https://github.com/NCATSTranslator/testing/blob/main/ars-requests/not-none/1.2/risk.json

There are numerous results I don't like. Like "disease" and "blood".

Also systematically it's preferring gene answers to chemical answers. Is that ok? Maybe.

Also, the first hits things that are near-synonyms with the input. This isn't wrong, it's right, but it's not terribly helpful.

cbizon commented 2 years ago

Interestingly this query: https://arax.ncats.io/?source=ARS&id=bf04c388-b4d2-482e-9ddc-abb92c6c81c8

which is the same, but uses "ChemicalEntity" produces much nicer results. I think it's because the original NamedThing sets the denominator of the enrichment to something giant. So even things like disease get linked in. Maybe we need some kind of dynamic denominator

cbizon commented 2 years ago

A similar issue can happen with e.g. chemicals. CHEBI is a subset of chemicals, but it has subclasses in it. If you use "all chemicals" as the denominator size, then if you have more chebis that randomly expected (which is reasonable given that chebi contains the 'most interesting' or at least most annotated chemicals), then it will look like you've chosen a meaningful set of chemicals because they're all descended from some high-level chemical class.

cbizon commented 2 years ago

I'm probably overthinking some of this. Our edges are based on what's in our local graph. So the denominators should be based on that, and we should just ignore edges that don't occur in that graph. There are perhaps other approaches but this is the most straightforward. So the main thing to do is first remove any answers that don't occur in our local graph.