pantherdb / pango

BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

When inferring slim terms, only retain the most specific terms if multiple are related #27

Closed dustine32 closed 1 year ago

dustine32 commented 1 year ago

In the generalize_term function, prevent a general term from being inferred if a more specific term is also present in ancestors of the annotated term.

Example case

Annotated term: single-stranded DNA helicase activity (GO:0017116) Inferred slim terms: catalytic activity, acting on DNA (GO:0140097) and catalytic activity (GO:0003824) and ATP-dependent activity (GO:0140657) image Here, the is_a path is: GO:0017116 -is_a-> GO:0140097 -is_a-> GO:0003824

This fix should stop the is_a path traversal at the first ancestor term encountered that is also in the goslim_generic subset. So, the new resulting inferred slim terms should not contain "catalytic activity" (GO:0003824).

dustine32 commented 1 year ago

Looks like this term GO:0017116 also has a path to GO:0003824 where GO:0003824 is the first encountered slim term:

GO:0017116 -is_a-> GO:0003678 -is_a-> GO:0004386 -is_a-> GO:0140640 -is_a-> GO:0003824

So the solution is to collect all inferred terms first, then compare them and remove any related ancestor terms.