merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

[DISCUSSION] Common misconceptions and mistakes for anvi'o beginners that we should clarify in a blog post #2230

Open ivagljiva opened 4 months ago

ivagljiva commented 4 months ago

@FlorianTrigodet and I have noticed several common issues emerging from recent Discord questions. These are not really bugs, but quirks of the anvi'o ecosystem that are not obvious to beginners. Many of them are even documented across our various tutorials and help pages, but clearly people aren't finding or understanding those pages because they keep making the same mistakes.

We thought this could be addressed with a new blog post, something like "Important things to learn when starting with anvi'o" or "FAQ and common issues for anvi'o beginners", in which we could have a section for each issue that 1) summarizes the convention, 2) explains why we do it this way, 3) provides links to related tutorials or documentation, and 4) explains what to do if you've already made the mistake. Then, when people ask these common questions, we'll be able to send them the URL to the appropriate section. We'll also be able to direct new users to read through the page first so that they can hopefully avoid some headaches.

Of course, we should also update the relevant anvi'o help pages associated with each issue (but it would still be useful to have these common issues described in one central location, IMO).

Here we are starting a list of these 'common issues', and once we collect several, we can put them together into a post. We welcome ideas and contributions from the community for this effort :)

To start the list, I scanned through the most recent Discord threads to identify the common themes (and I'm also drawing from my memory of things that I always find myself explaining to people in workshops and stuff).

Mismatch between reformatted contig headers and BAM files

A lot of people have the problem that they did some metagenomic read recruitment to reference FASTA files with contig headers that are incompatible with anvi'o. They use anvi-script-reformat-fasta so that they can make the contigs database, but then run into issues later when trying to run anvi-profile because the contig names in their contigs db don't match to those in their BAM files.

See the following Discord questions:

What we need to tell people is this:

General confusion about profile databases vs single profiles and merging

The main problem here is that many people don't really think about what is going on behind the scenes with profiling and merging. They probably just see those steps in the metagenomic workflow and assume that they always have to run them.

Which leads to Discord questions like these:

I think we should direct new users to:

If you want to add/remove genomes from a pangenome, you need to re-compute the pangenome

... with the caveat that if you are doing enrichment on the pangenome with categorical variables, you can exclude genomes from the enrichment analysis without removing them from the pangenome (as discussed here). Here are the related Discord questions:

General confusion about importing misc data and data orders into databases

(or sometimes people aren't even aware that this is a thing they can do).

What information does anvi-summarize provide in each input case?

I often find myself recommending people to use anvi-summarize to get the data they want, but I always forget what output files it gives you, so it is hard to determine if it is the appropriate solution for someone's question. Even when people find and use anvi-summarize by themselves, they sometimes have questions about what each data type means. For example:

This doesn't necessarily need its own section in the FAQ post, but is more of a note that we should update the help page for anvi-summarize to describe the output files you get when you run it on a contigs + profile db vs a pangenome, etc.

General confusion about external vs internal genomes

... and making people aware that you CAN combine multiple genomes into the same contigs DB for combined analysis by reformatting the contig headers with --prefix and importing a collections txt.

meren commented 4 months ago

This is a great point and a welcome attempt to ameliorate. I had hoped our help pages would address these issues, but I guess they are not enough by themselves as you point out.

Of course, we should also update the relevant anvi'o help pages associated with each issue (but it would still be useful to have these common issues described in one central location, IMO).

But I couldn't agree more with this statement above.

The funny thing is, we're using Discord so that the answers accumulate over time, so we don't have to respond to the same questions over and over again. But then, we realize we do that still, and now we are trying to put together an F.A.Q. by going through Discord :p Kind of funny and sad at the same time.

ivagljiva commented 4 months ago

Yeah, it is a little bit frustrating. I think one reason this doesn't work:

we're using Discord so that the answers accumulate over time, so we don't have to respond to the same questions over and over again.

Is that the search functions for posts is really bad (much worse than in Slack). From my experience, it seems like the search function only looks through the titles of posts, not the content of each thread. The titles of posts are generally very very poorly written, so of course people don't find anything. And sometimes people are posting questions within other threads that are only marginally related, so it gets lost that way.

And more likely than not, a lot of people just don't bother to read what was posted before, or to search the help pages at all. But I'm not sure how to discourage this behavior without refusing to answer people who haven't done their due diligence first, which feels wrong. Especially since it is not always clear if someone tried to look through previous posts or help pages, unless they explicitly say so).

Hopefully this effort will yield improvements to the most commonly-needed help pages so that we have multiple links to throw at people with these specific issues 😞

FlorianTrigodet commented 4 months ago

Is that the search functions for posts is really bad (much worse than in Slack). From my experience, it seems like the search function only looks through the titles of posts, not the content of each thread.

There are (unfortunately) two search bars in Discord. There is the big one that is very inviting but only search terms in the post's title. And there is a second, smaller one in the top right corner that is a proper search bar and works nicely.

Screenshot 2024-02-27 at 10 44 56

I can modify that screen shot and we could add it in the discord's rule-and-guidelines channel.

meren commented 4 months ago

And more likely than not, a lot of people just don't bother to read what was posted before, or to search the help pages at all. But I'm not sure how to discourage this behavior without refusing to answer people who haven't done their due diligence first, which feels wrong.

This highlights so well the dilemma inflicted upon people whose goal is to develop solutions that try to match the sophistication of the questions they aim to address.

While we don't want to alienate or push away those who don't have time or interest to read even the most clear error message that already explains them the problem and the solution, we are taking more and more time from our primary tasks to help them.

The more I think about it, the more I realize that we need a revolution rather than a yet another solution that will not go beyond what we have been already doing: trying to help those who will have time to read things (which often don't need our help).

So what would be the revolution in this context? Well, probably developing a language model that processes all our code, documentation, and Discord material periodically to give access to that nebula of wisdom through a chatbot. In an ideal world, the precious time of those who are genuinely thinking of the future of this community would be better spent on investigating available technologies to establish such a long-term solution than a blog post. But I know we do not live in an ideal world, and we are just trying to put out fires most of the time. Which is also admirable and needed, and this is what that blog post will do. So I am not saying let's stop doing this and do the other thing. But I just wanted to share my 2 cents in case it turns a light bulb in someone else's mind.

Ge0rges commented 3 months ago

Perhaps an online Anvi'o forum that would get indexed by Google would help with this on the long term? For example a hosted Discourse.