tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
749 stars 212 forks source link

basic validity checks of id2gos association file #213

Open orenmn opened 3 years ago

orenmn commented 3 years ago

It seems to me that basic validity checks could be added to _init_id2gos() in anno/init/reader_idtogos.py, such that an error is raised in case of an invalid association file.

Rational: I was careless enough to build an association file with "; " separating between GO ids, and goatools silently used only a single GO id in every line of the association file. Of course, that was my fault, but I think it would be better to raise an error in case of invalid GO ids. Maybe even add a regex to verify that all GO ids in the association file are as expected? (I guess the only disadvantage for this is runtime, but I don't think it is significant)

dvklopfenstein commented 3 years ago

Thank you for your interest in GOA TOOLs and for taking the time to write us.

I have implemented a number of checks for gaf and gpad files. We have found many bugs in the annotation files and have reported them the Gene Ontology Consortium.

Having a checker is a great idea. I would prefer to have the checking separate from the reading of the annotation files so it does not slow down the reading of the annotations, which can be quite large. Also, checking the format of an annotation file only needs to be done once for a new annotation file, not everytime the annotations are read for an analysis.

It would be an extremely welcome addition if you would like to add a stand-alone checker for the id2gos annotation format. Your pull request would be very easy to accept. Also, it would likely not be a burden for you to support your new software because of the nature of the addition.

What do you think about writing a checker and becoming a GOATOOLS contributor?

orenmn commented 3 years ago

Sorry, but I lack time :|