tanghaibao / goatools

Python library to handle Gene Ontology (GO) terms
BSD 2-Clause "Simplified" License
749 stars 212 forks source link

Unhelpful Documentation, Missing High Level Interface #241

Open fratajcz opened 2 years ago

fratajcz commented 2 years ago

Hi!

I really value your effort in trying to bring GOEA to python, but please, please, please go all the way. This whole library feels like someone did the hard things (programming the actual functionality) and then just abandoned it without giving users a chance to actually use it. As already was mentioned, the documentation is not helpful. I can't find a comprehensive explanation of the task that I want to achieve with this library.

My task is simple, I want to do a GO term enrichment analysis. From what I understand I should need three things to do that:

A user friendly library should even hide the last point and let me choose it by specifying fitting criteria in a high-level interface.

Let me show how your library should work like with some python pseudocode:

results = [(a small number of HGNC identifiers]
background = [(a larger number of HGNC identifiers)]

from goatools import GOEnrichmentStudyNS

study = GOEnrichmentStudyNS(results, background, organism="human", ontology="GO", identifiers="hgnc")
df = study.do_analysis()

et voila, the final dfshould hold my GOEA results. Two lines, not counting the initialization of the iterables. Sadly, the user has to search through undescribed example scripts that each contain hundreds of lines of code that doesnt explain much of what its doing and keeps the reader guessing. Even if I could make it work, I would have to spend days and write a high-level wrapper for myself, otherwise I'd end up with brittle and bloated scripts that become unreadable withing weeks.

I am sorry if I sound frustrated but I kinda am. Issues like these are some of the reasons why bioinformatics is needlessly hard.

fratajcz commented 2 years ago

For example, the R package clusterProfiler has a high-level interface that works exactly like that: https://yulab-smu.top/biomedical-knowledge-mining-book/clusterprofiler-go.html#clusterprofiler-go-ora

szarecor commented 2 years ago

@fratajcz: You should delete this issue and apologize to the goatools contributors.

Your comments are belittling and not constructive. Apologizing for being frustrated does not excuse or justify your tone.

If you don't like the free work that others have made available to you, feel free to write your own library.

fratajcz commented 2 years ago

@szarecor I agree, the tone of my original wording was not constructive, hence I redacted it. However, I still stand by the content of the issue. I think it is constructive, as it actually provides an example for a real improvement that could be made that would greatly increase the usability of this library. I think it is relevant, since I have seen several other Issues going in the same direction. On top of that, I point out an R library that does it exactly as in the example that I gave, adding further relevance and credibility to the issue.

I am not sorry for calling out a problem that has cost me several hours, since it has probably cost several hours to several hundreds of grad students. With the clusterProfiler package I had my solution within 15 minutes, so people should be advised to rather invest their time in learning rpy2 to integrate R in their python workflow.

lkondratova commented 2 years ago

I smoothly run goatools enrichment both, from the command line and incorporated into my custom script (no previous experience with GO). I used example notebooks for the specific functions, enrichment analysis works just fine. Quite surprised to see this issue. Not sure what your implication is though.

fratajcz commented 2 years ago

I didn't imply anything, I directly addressed the issue that I have. It is nice that you guys figured it out, however, that doesnt mean it can't be improved. And again, I don't say the library is bad, it just has unnecessary hurdles that keep people from using it (and, thus, the people from citing the library's paper in their work). Just as an example of what could be easily improved:

For the command line usage, how should the three files look like? Does the ontology have to be an .owl file, an .obo file or is .csv okay, too? Does it work with Entrez Ids, Ensemble Ids or Gene Symbols? The only help refers to tests/data/which contains roughly 20 files, none of which is named like population_example_fileor study_example_fileor something. Why is the user kept guessing?

All I wanted is to point out how easy it would be to make this library ten times more accessible. I also get the feeling that this kind of improvement is not welcome here. I have now built my solution in python using lightgeoa and implemented missing features myself.