Closed okcoskun closed 4 years ago
Thank you very much for your detailed report, @okcoskun.
TL;DR everything is working on anvi'o side and this is a user-side issue.
This was very interesting and unexpected as it is coming from one of the most well-tested parts of anvi'o that has been working for everyone, so it certainly is not due to a bug on our end. But what could one do to get this was not clear to me. But I think I figured this out.
What makes this interesting is this: you are using the same contigs database to profile two different BAM files. But somehow in the first profiling attempt the output says this,
num_contigs ........................: 6,246
num_contigs_after_M ................: 6,246
num_splits .........................: 6,328
in the other, it says this:
num_contigs ........................: 2,410
num_contigs_after_M ................: 2,410
num_splits .........................: 2,559
The num_contigs
information is not coming from the contigs database, but it is learned from the BAM file itself, so even though num_contigs_after_M
can differ based on parameters you set through anvi-profile
, there is not reason for num_contigs
to not be identical to each other across BAM files if you truly used the same FASTA file for read recruitment. Just for future reference, here is the code evidence from the relevant module that shows num_contigs
in the anvi-profile
output messages is coming from the BAM file:
(...)
self.contig_names = self.bam.references
(...)
self.run.info('num_contigs', pp(len(self.contig_names)))
(...)
If the FASTA file from which you generated the contigs database had contig names that didn't match to names in BAM files, anvi'o would have caught it. So I think the only way for the number of contigs to be different in two BAM files that are generated by recruiting reads using the same FASTA file is that the default parameters in CLC to create these BAM files is doing something unexpected (or you generated two different FASTA files with different number of contigs but with matching contig names and did read recruitment separately yet have used only one of them to create a contigs database .. which is too elaborate of an evil scheme for any end-user to implement, but if that is the case, CLC is innocent).
Probably there is a checkbox to click or something to make CLC produce BAM files the way everyone else is generating them. You can do that and everything should work out. But maybe this is a good opportunity to consider ditching CLC and using all the open-source alternatives such as BWA, Bowtie2, samtools, etc.
My friends were using CLC at the time, so the data we had profiled and published in the original anvi'o publication was coming from CLC. So at that time things have worked out-of-the-box. But then we stopped using CLC, and we didn't miss it for anything since then.
Apart from these 2 cents of mine, the long story short, everything is working on anvi'o side as expected as far as I can tell.
Best,
Thank you for the prompt reply! I really appreciate it.
Himmmm, If I get it right, BAM files should be produced for each sample using the same FASTA file. So, this should be my fault.
I think it is okay to add images here taken from CLC. If not, please delete it, since I couldn't see any instructions prohibiting uploading images from a commercial software. So let me explain my workflow using these.
1) I analyzed my metagenomes separately. Then right clicked to get the BAM files and FASTA files from each assembly subset files.
2) So, now I have two FASTA files and two BAM files. Then, I concatenate two exported fasta files (The first step in the issue).
cat TwoMetagenomes/Assembly/*.fa >contigs_combined.fasta
3) Use Anvio metagenomic workflow.
Since I am using separate BAM files, they have different number of contigs. But if you sum up the number of contigs in the BAM files (6,246 + 2,410 = 8656), It matches to the number of contigs in the concatenated contigs.fa.
Contigs with at least one gene call ..........: 8656 of 8656 (100.0%)
Contigs database .............................: A new database, contigs.db, has been created.
Number of contigs ............................: 8,656
Number of splits .............................: 8,887
It seems that I am making a simple mistake, but might be helpful for the community who is just starting the metagenomic analysis and want to use this beautifully designed software -- Anvi'o.
Thanks in advance for your comment, again.
Ömer
Himmmm, If I get it right, BAM files should be produced for each sample using the same FASTA file.
Exactly.
So, now I have two FASTA files and two BAM files. Then, I concatenate two exported fasta files
As you identified, this is the problem. You should concatenate the two FASTA files. Then you should produce 2 BAM files from the concatenated FASTA.
Thank you very much Evan! Sorry for taking Murat's and your time for this very simple mistake.
Ömer
No problem. Glad it is clear to you now.
So, now I have two FASTA files and two BAM files. Then, I concatenate two exported fasta files
As you identified, this is the problem. You should concatenate the two FASTA files. Then you should produce 2 BAM files from the concatenated FASTA.
In fact it is just a little more than that.
I presume those two FASTA files are the assembly of each metagenome independently. If you simply concatenate the two FASTA files and do read recruitment, you will run into another, more of a theoretical, problem: the likely redundancy of contigs in both files (and there is no easy solution for that).
So the concatenation of the two metagenomes should take place before the assembly. You should combine all metagenomes involved (i.e., by concatenating R1s and R2s, or by asking assembler to use all related FASTQ files), and then use the resulting FASTA file of contigs to do independent read recruitments (this strategy is typically referred to as "co-assembly" and has advantages and disadvantages compared to single assemblies).
Best,
Hi Murat
This is the first time I am using Anvi'o and installed using conda in my laptop which has Linux operation system.
anvi-self-test --suite mini
works fine.Anyway, I have bunch of metagenomic libraries that I would like to profile using Anvi'o. I haven't done any binning yet, but first I would like to analyze them following your metagenomic tutorial (http://merenlab.org/2016/06/22/anvio-tutorial-v2/). Before starting working with my dataset, I used the provided contigs.fa and BAM files in this tutorial and worked perfectly fine. However, after I gave 2 of my metagenomic libraries, I got an error in anvio-merge step which is:
All the BAM files and contigs were generated in CLC with minimum contig length 1000, exported properly and an example of the names looks like this:
Kestanbol_trimmed_contig_12_mapping
which should be okay based on the tutorial. Since I couldn't see any option to index and sort BAM files in CLC, I performedanvi-init-bam Kestanboltrimmed\ assembly\ subset.bam -o Kes1.bam
.Let me provide all the scripts: 1)
cat TwoMetagenomes/Assembly/*.fa >contigs_combined.fasta
2)
3)
4) Here are 2 warnings from Anvi'o.
5) Sorting BAM files exported from CLC.
6)
7) Merging
I used same flags and parameters for all analysis. Am I missing something? Or the environmental data that I am trying to analyse here is not appropriate for this type of analysis? I can send you the data if needed. Thank you very much in advance.
Cheers
Ömer Kürsat Coskun