tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
783 stars 210 forks source link

Potential bug: "WARNING: GO:0006379 NOT FOUND IN DAG" #276

Closed Maxim-Karpov closed 1 year ago

Maxim-Karpov commented 1 year ago

Hello, I've realised that there may be a potential bug with the enrichment tool where an entry in the obo is considered obsolete. For example in the following entries:

WARNING: GO:0000469 NOT FOUND IN DAG WARNING: GO:0006379 NOT FOUND IN DAG WARNING: GO:0010862 NOT FOUND IN DAG WARNING: GO:0014065 NOT FOUND IN DAG WARNING: GO:0014066 NOT FOUND IN DAG WARNING: GO:0016307 NOT FOUND IN DAG WARNING: GO:0030579 NOT FOUND IN DAG WARNING: GO:0031532 NOT FOUND IN DAG WARNING: GO:0035551 NOT FOUND IN DAG WARNING: GO:0042779 NOT FOUND IN DAG WARNING: GO:0043046 NOT FOUND IN DAG WARNING: GO:0043629 NOT FOUND IN DAG WARNING: GO:0043631 NOT FOUND IN DAG WARNING: GO:0047690 NOT FOUND IN DAG WARNING: GO:0048017 NOT FOUND IN DAG WARNING: GO:0061088 NOT FOUND IN DAG WARNING: GO:0070084 NOT FOUND IN DAG WARNING: GO:0090502 NOT FOUND IN DAG WARNING: GO:0098789 NOT FOUND IN DAG WARNING: GO:0102176 NOT FOUND IN DAG WARNING: GO:0102756 NOT FOUND IN DAG WARNING: GO:1903204 NOT FOUND IN DAG

These have been replaced by a different GO term but the goatools considers them as absent. It would be nice if the program replaced these terms for the user (if the replacement is present), or counted them in regardless of the obsolete status (an option for this).

tanghaibao commented 1 year ago

thanks you. this is a great suggestion. I am trying to understand this better and see how we can do this in a non-ambiguous way.

what should happen if there are multiple replacements? are the replacements semantically the same (or split from the old terms, in which case the semantic meaning changes)

Maxim-Karpov commented 1 year ago

thanks you. this is a great suggestion. I am trying to understand this better and see how we can do this in a non-ambiguous way.

what should happen if there are multiple replacements? are the replacements semantically the same (or split from the old terms, in which case the semantic meaning changes)

Perhaps all of the available replacements could be substituted into the analysis. As far as I've seen, the replacements tend to be very similar to their obsolete categories. For example obsolete GO term "cleavage involved in rRNA processing GO:0000469" is replaced by "rRNA processing GO:0006364".

Maxim-Karpov commented 1 year ago

This seems to be more complicated than I thought as there are also consider tags. Furthermore, some term replacements can be crude simplifications/abstractions of the originals e.g. "obsolete chaperonin ATPase activity GO:0003763" is replaced by "ATP hydrolysis activity GO:0016887". Given that only 1 replacement term is ever available per obsolete id, it is arguably justifiable to simply replace them in the analysis.

Here's the code to extract all ids, replacements, and considerations for all obsolete entries from the obo file FYI (credit: @iquasere):

awk 'BEGIN {print "id\treplaced_by\tconsider"}
/^\[Term\]/{if(is_obsolete) print id"\t"replaced_by"\t"consider; is_obsolete=id=replaced_by=consider=""; next}
/^id:/{id=$2}
/^is_obsolete: true/{is_obsolete=1}
/^replaced_by:/{replaced_by=replaced_by ? replaced_by";"$2 : $2}
/^consider:/{consider=consider ? consider";"$2 : $2}' go-basic.obo > go-basic.tsv
tanghaibao commented 1 year ago

@Maxim-Karpov

Thank you for the deep dive on this. I'll attempt a fix this weekend - perhaps bringing in both replaced_by and consider.

tanghaibao commented 1 year ago

@Maxim-Karpov

I just had a commit adding an option --obsolete to find_enrichment.py.

  --obsolete {keep,replace,skip}
                        Strategy for handling obsolete GO terms (default: skip)

The replace strategy updates the obsolete GO term with terms suggested in replaced_by and consider. Please note that the default behavior stays the same, which is to skip the obsolete terms.

Thank you again for the great idea - and please let me know if there's an issue.

dieunelderilus commented 5 months ago

Hello @tanghaibao, I found the same problem mentioned in this issue! I have downloaded the last version of the gooatools and find_enrichment.py -h did not show me the --obsolete option. Do you have any idea?

tanghaibao commented 5 months ago

@dieunelderilus

Did you try updating the goatools? pip install -U goatools.

Also the latest version as of today:

python -c "import goatools; print(goatools.__version__)"
1.4.11

should have the --obsolete option:

find_enrichment.py -h | grep obsolete
  --obsolete {keep,replace,skip}
                        Strategy for handling obsolete GO terms (default: