merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
436 stars 144 forks source link

Anvi-merge gives an error - The number of nucleotides described are not identical for all profiles to be merged #1403

Closed okcoskun closed 4 years ago

okcoskun commented 4 years ago

Hi Murat

This is the first time I am using Anvi'o and installed using conda in my laptop which has Linux operation system. anvi-self-test --suite mini works fine.

Anvi'o version ...............................: esther (v6.1)
    Profile DB version ...........................: 31
    Contigs DB version ...........................: 14
    Pan DB version ...............................: 13
    Genome data storage version ..................: 6
    Auxiliary data storage version ...............: 2
    Structure DB version .........................: 1

Anyway, I have bunch of metagenomic libraries that I would like to profile using Anvi'o. I haven't done any binning yet, but first I would like to analyze them following your metagenomic tutorial (http://merenlab.org/2016/06/22/anvio-tutorial-v2/). Before starting working with my dataset, I used the provided contigs.fa and BAM files in this tutorial and worked perfectly fine. However, after I gave 2 of my metagenomic libraries, I got an error in anvio-merge step which is:

Ouch. The number of nucleotides described are not identical for all profiles to be merged, which is a deal breaker. All profiles that are going to be merged must be run with identical flags and parameters :/ You really shouldn't but if you want to try to force things because you believe this is due to a misunderstanding, you can use the flag --force. While you are considering this as an option, please also remember that this we advice against it..

All the BAM files and contigs were generated in CLC with minimum contig length 1000, exported properly and an example of the names looks like this: Kestanbol_trimmed_contig_12_mapping which should be okay based on the tutorial. Since I couldn't see any option to index and sort BAM files in CLC, I performed anvi-init-bam Kestanboltrimmed\ assembly\ subset.bam -o Kes1.bam.

Let me provide all the scripts: 1) cat TwoMetagenomes/Assembly/*.fa >contigs_combined.fasta

2)

anvio-6.1) ocoskun@mla085:~/OmerWorkingFolder/Turkey_data$ anvi-gen-contigs-database -f contigs_combined.fasta -o contigs.db -n ExampleMetagenomes
Input FASTA file .............................: /home/ocoskun/OmerWorkingFolder/Turkey_data/contigs_combined.fasta
Name .........................................: ExampleMetagenomes
Description ..................................: No description is given
Split Length .................................: 20,000                                                                                              
K-mer size ...................................: 4
Skip gene calling? ...........................: False
External gene calls provided? ................: None
Ignoring internal stop codons? ...............: False
Splitting pays attention to gene calls? ......: True

Finding ORFs in contigs
===============================================
Genes ........................................: /tmp/tmpiqrtqi0y/contigs.genes
Amino acid sequences .........................: /tmp/tmpiqrtqi0y/contigs.amino_acid_sequences
Log file .....................................: /tmp/tmpiqrtqi0y/00_log.txt
Result .......................................: Prodigal (v2.6.3) has identified 36374 genes.                                                       

Contigs with at least one gene call ..........: 8656 of 8656 (100.0%)                                                                               
Contigs database .............................: A new database, contigs.db, has been created.
Number of contigs ............................: 8,656
Number of splits .............................: 8,887
Total number of nucleotides ..................: 30,985,319
Gene calling step skipped ....................: False
Splits broke genes (non-mindful mode) ........: False
Desired split length (what the user wanted) ..: 20,000
Average split length (wnat anvi'o gave back) .: 22,943

3)

(anvio-6.1) ocoskun@mla085:~/OmerWorkingFolder/Turkey_data$ anvi-run-hmms -c contigs.db
Target found .................................: RNA:CONTIG                                                                                          
Target found .................................: AA:GENE                                                                                             

HMM Profiling for Ribosomal_RNAs                                                                                                                    
===============================================                                                                                                     
Reference ....................................: Seemann T, https://github.com/tseemann/barrnap
Kind .........................................: Ribosomal_RNAs
Alphabet .....................................: RNA
Context ......................................: CONTIG
Domain .......................................: N\A
HMM model path ...............................: /home/ocoskun/anaconda3/envs/anvio-6.1/lib/python3.6/site-packages/anvio/data/hmm/Ribosomal_RNAs/genes.hmm.gz                                                                                                                                           
Number of genes ..............................: 12
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
Temporary work dir ...........................: /tmp/tmp1eglq4g5
Log file .....................................: /tmp/tmp1eglq4g5/00_log.txt
Number of raw hits ...........................: 27                                                                                                  
Pruned .......................................: 9 out of 27 hits were removed due to redundancy
Gene calls added to db .......................: 18 (from source "Ribosomal_RNAs")                                                                   

HMM Profiling for Bacteria_71                                                                                                                       
===============================================                                                                                                     
Reference ....................................: Lee modified, https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: bacteria
HMM model path ...............................: /home/ocoskun/anaconda3/envs/anvio-6.1/lib/python3.6/site-packages/anvio/data/hmm/Bacteria_71/genes.hmm.gz                                                                                                                                              
Number of genes ..............................: 71
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
Temporary work dir ...........................: /tmp/tmp5lw3xm38
Log file .....................................: /tmp/tmp5lw3xm38/00_log.txt
Number of raw hits ...........................: 896                                                                                                 

HMM Profiling for Protista_83
===============================================
Reference ....................................: Delmont, http://merenlab.org/delmont-euk-scgs
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: eukarya
HMM model path ...............................: /home/ocoskun/anaconda3/envs/anvio-6.1/lib/python3.6/site-packages/anvio/data/hmm/Protista_83/genes.hmm.gz
Number of genes ..............................: 83
Noise cutoff term(s) .........................: -E 1e-25
Number of CPUs will be used for search .......: 1
Temporary work dir ...........................: /tmp/tmp5lw3xm38
Log file .....................................: /tmp/tmp5lw3xm38/00_log.txt
Number of raw hits ...........................: 93                                                                                                  

HMM Profiling for Archaea_76
===============================================
Reference ....................................: Lee, https://doi.org/10.1093/bioinformatics/btz188
Kind .........................................: singlecopy
Alphabet .....................................: AA
Context ......................................: GENE
Domain .......................................: archaea
HMM model path ...............................: /home/ocoskun/anaconda3/envs/anvio-6.1/lib/python3.6/site-packages/anvio/data/hmm/Archaea_76/genes.hmm.gz
Number of genes ..............................: 76
Noise cutoff term(s) .........................: --cut_ga
Number of CPUs will be used for search .......: 1
Temporary work dir ...........................: /tmp/tmp5lw3xm38
Log file .....................................: /tmp/tmp5lw3xm38/00_log.txt
Number of raw hits ...........................: 698                                                                                                 

✓ anvi-run-hmms took 0:04:43.909206

4) Here are 2 warnings from Anvi'o.

(anvio-6.1) ocoskun@mla085:~/OmerWorkingFolder/Turkey_data$ anvi-run-ncbi-cogs -c contigs.db --num-threads 8
COG data directory ...........................: /home/ocoskun/anaconda3/envs/anvio-6.1/lib/python3.6/site-packages/anvio/data/misc/COG
COG data source ..............................: The anvi'o default.
COG data directory ...........................: /home/ocoskun/anaconda3/envs/anvio-6.1/lib/python3.6/site-packages/anvio/data/misc/COG
Searching with ...............................: diamond
Directory to store temporary files ...........: /tmp/tmp8yky_zpi
Directory will be removed after the run ......: True

WARNING
===============================================
18 entries in the sequences table had blank sequences :/ This is related to the
issue at https://github.com/merenlab/anvio/issues/565. If this is like mid-2020
and you still are getting this warning, please find an anvi'o developer and make
them feel embarrassed. If it is earlier than that, then take this as a simple
warning to remember that some gene calls in your downstream analyses may have no
amino acid sequences, and that's actuall OK. This is a very minor issue due to
on-the-fly addition of Ribosomal RNA gene calls to the contigs database, and
will unlikely affect anything major. This warning will go away when anvi'o can
seamlessly work with multiple gene callers (which we are looking forward to
implement in the future).

Sequences ....................................: 36374 sequences reported.
FASTA ........................................: /tmp/tmp8yky_zpi/aa_sequences.fa
DIAMOND is set to be .........................: Fast
Diamond blastp results .......................: /tmp/tmp8yky_zpi/diamond-search-results.daa                                                         
Diamond tabular output file ..................: /tmp/tmp8yky_zpi/diamond-search-results.txt                                                         
COG data directory ...........................: /home/ocoskun/anaconda3/envs/anvio-6.1/lib/python3.6/site-packages/anvio/data/misc/COG
COG data source ..............................: The anvi'o default.
Gene functions ...............................: 55158 function calls from 2 sources for 27579 unique gene calls has been added to the contigs database.

WARNING
===============================================
Well. Your COGs were successfully added to the database, but there were some
garbage anvi'o brushed off under the rug. There were 2 genes in your database
that hit 1 protein IDs in NCBIs COGs database, but since NCBI did not release
what COGs they correspond to in the database they made available (that helps us
to resolve protein IDs to COG ids), we could not annotate those genes with
functions. Anvi'o apologizes on behalf of all computer scientists for half-done
stuff we often force biologists to deal with. If you want to do some Googling,
these were the offending protein IDs: '557694907'.

5) Sorting BAM files exported from CLC.

(anvio-6.1) ocoskun@mla085:~/OmerWorkingFolder/Turkey_data$ anvi-init-bam TwoMetagenomes/BAM/Kestanboltrimmed\ assembly\ subset.bam -o Kestan.bam
Sorted BAM File ..............................: /home/ocoskun/OmerWorkingFolder/Turkey_data/Kestan.bam                                              
BAM File Index ...............................: /home/ocoskun/OmerWorkingFolder/Turkey_data/Kestan.bam.bai                                          

(anvio-6.1) ocoskun@mla085:~/OmerWorkingFolder/Turkey_data$ anvi-init-bam TwoMetagenomes/BAM/HidSed2trimmed\ assembly\ subset.bam -o HidSed2.bam
Sorted BAM File ..............................: /home/ocoskun/OmerWorkingFolder/Turkey_data/HidSed2.bam                                             
BAM File Index ...............................: /home/ocoskun/OmerWorkingFolder/Turkey_data/HidSed2.bam.bai                

6)

(anvio-6.1) ocoskun@mla085:~/OmerWorkingFolder/Turkey_data$ anvi-profile -i Kestan.bam -c contigs.db             
Contigs DB .........................: Initialized: contigs.db (v. 14)                                                                               

WARNING                                                                                                                                             
=====================================                                                                                                               
Amino acid linkmer frequencies will not be characterized for this profile.                                                                          

anvio ..............................: 6.1
profiler_version ...................: 31
sample_id ..........................: KESTAN
description ........................: None
profile_db .........................: /home/ocoskun/OmerWorkingFolder/Turkey_data/Kestan.bam-ANVIO_PROFILE/PROFILE.db
contigs_db .........................: True
contigs_db_hash ....................: hashf66068e5
cmd_line ...........................: /home/ocoskun/anaconda3/envs/anvio-6.1/bin/anvi-profile -i Kestan.bam -c contigs.db
merged .............................: False
blank ..............................: False
split_length .......................: 20,000
min_contig_length ..................: 1,000
max_contig_length ..................: 9,223,372,036,854,775,807
min_mean_coverage ..................: 0
clustering_performed ...............: False
min_coverage_for_variability .......: 10
skip_SNV_profiling .................: False
profile_SCVs .......................: False
report_variability_full ............: False

WARNING                                                                                                                                             
=====================================                                                                                                               
Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take                                                                      
into consideration anything below that. If you need to kill this an restart your                                                                    
analysis with another minimum contig length value, feel free to press CTRL+C.                                                                       

input_bam ..........................: /home/ocoskun/OmerWorkingFolder/Turkey_data/Kestan.bam                                                        
output_dir .........................: /home/ocoskun/OmerWorkingFolder/Turkey_data/Kestan.bam-ANVIO_PROFILE
total_reads_mapped .................: 3,187,400
num_contigs ........................: 2,410
num_contigs_after_M ................: 2,410
num_splits .........................: 2,559
total_length .......................: 13,456,649
max_coverage_depth .................: 8,000

New data for 'layers' in data group 'default'                                                                                                       
=====================================                                                                                                               
Data key "total_reads_mapped" ......: Predicted type: int                                                                                           
Data key "num_SNVs_reported" .......: Predicted type: int

NEW DATA                                                                                                                                            
=====================================                                                                                                               
Database ...........................: profile
Data group .........................: default
Data table .........................: layers
New data keys ......................: total_reads_mapped, num_SNVs_reported.

* Happy 😇

✓ anvi-profile took 0:10:27.740613
(anvio-6.1) ocoskun@mla085:~/OmerWorkingFolder/Turkey_data$ anvi-profile -i HidSed2.bam -c contigs.db
Contigs DB .........................: Initialized: contigs.db (v. 14)                                                                               

WARNING                                                                                                                                             
=====================================                                                                                                               
Amino acid linkmer frequencies will not be characterized for this profile.                                                                          

anvio ..............................: 6.1
profiler_version ...................: 31
sample_id ..........................: HIDSED2
description ........................: None
profile_db .........................: /home/ocoskun/OmerWorkingFolder/Turkey_data/HidSed2.bam-ANVIO_PROFILE/PROFILE.db
contigs_db .........................: True
contigs_db_hash ....................: hashf66068e5
cmd_line ...........................: /home/ocoskun/anaconda3/envs/anvio-6.1/bin/anvi-profile -i HidSed2.bam -c contigs.db
merged .............................: False
blank ..............................: False
split_length .......................: 20,000
min_contig_length ..................: 1,000
max_contig_length ..................: 9,223,372,036,854,775,807
min_mean_coverage ..................: 0
clustering_performed ...............: False
min_coverage_for_variability .......: 10
skip_SNV_profiling .................: False
profile_SCVs .......................: False
report_variability_full ............: False

WARNING
=====================================
Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take
into consideration anything below that. If you need to kill this an restart your
analysis with another minimum contig length value, feel free to press CTRL+C.

input_bam ..........................: /home/ocoskun/OmerWorkingFolder/Turkey_data/HidSed2.bam                                                       
output_dir .........................: /home/ocoskun/OmerWorkingFolder/Turkey_data/HidSed2.bam-ANVIO_PROFILE
total_reads_mapped .................: 1,727,607
num_contigs ........................: 6,246
num_contigs_after_M ................: 6,246
num_splits .........................: 6,328
total_length .......................: 17,528,670
max_coverage_depth .................: 8,000

New data for 'layers' in data group 'default'
=====================================
Data key "total_reads_mapped" ......: Predicted type: int
Data key "num_SNVs_reported" .......: Predicted type: int

NEW DATA
=====================================
Database ...........................: profile
Data group .........................: default
Data table .........................: layers
New data keys ......................: total_reads_mapped, num_SNVs_reported.

* Happy 😇

7) Merging

(anvio-6.1) ocoskun@mla085:~/OmerWorkingFolder/Turkey_data$ anvi-merge Kestan.bam-ANVIO_PROFILE/PROFILE.db HidSed2.bam-ANVIO_PROFILE/PROFILE.db -o SAMPLES-MERGED -c contigs.db

Config Error: Ouch. The number of nucleotides described are not identical for all profiles to
              be merged, which is a deal breaker. All profiles that are going to be merged   
              must be run with identical flags and parameters :/ You really shouldn't but if 
              you want to try to force things because you believe this is due to a           
              misunderstanding, you can use the flag --force. While you are considering this 
              as an option, please also remember that this we advice against it..         

I used same flags and parameters for all analysis. Am I missing something? Or the environmental data that I am trying to analyse here is not appropriate for this type of analysis? I can send you the data if needed. Thank you very much in advance.

Cheers

Ömer Kürsat Coskun

meren commented 4 years ago

Thank you very much for your detailed report, @okcoskun.

TL;DR everything is working on anvi'o side and this is a user-side issue.


This was very interesting and unexpected as it is coming from one of the most well-tested parts of anvi'o that has been working for everyone, so it certainly is not due to a bug on our end. But what could one do to get this was not clear to me. But I think I figured this out.

What makes this interesting is this: you are using the same contigs database to profile two different BAM files. But somehow in the first profiling attempt the output says this,

num_contigs ........................: 6,246
num_contigs_after_M ................: 6,246
num_splits .........................: 6,328

in the other, it says this:

num_contigs ........................: 2,410
num_contigs_after_M ................: 2,410
num_splits .........................: 2,559

The num_contigs information is not coming from the contigs database, but it is learned from the BAM file itself, so even though num_contigs_after_M can differ based on parameters you set through anvi-profile, there is not reason for num_contigs to not be identical to each other across BAM files if you truly used the same FASTA file for read recruitment. Just for future reference, here is the code evidence from the relevant module that shows num_contigs in the anvi-profile output messages is coming from the BAM file:

(...)
self.contig_names = self.bam.references
(...)
self.run.info('num_contigs', pp(len(self.contig_names)))
(...)

If the FASTA file from which you generated the contigs database had contig names that didn't match to names in BAM files, anvi'o would have caught it. So I think the only way for the number of contigs to be different in two BAM files that are generated by recruiting reads using the same FASTA file is that the default parameters in CLC to create these BAM files is doing something unexpected (or you generated two different FASTA files with different number of contigs but with matching contig names and did read recruitment separately yet have used only one of them to create a contigs database .. which is too elaborate of an evil scheme for any end-user to implement, but if that is the case, CLC is innocent).

Probably there is a checkbox to click or something to make CLC produce BAM files the way everyone else is generating them. You can do that and everything should work out. But maybe this is a good opportunity to consider ditching CLC and using all the open-source alternatives such as BWA, Bowtie2, samtools, etc.

My friends were using CLC at the time, so the data we had profiled and published in the original anvi'o publication was coming from CLC. So at that time things have worked out-of-the-box. But then we stopped using CLC, and we didn't miss it for anything since then.

Apart from these 2 cents of mine, the long story short, everything is working on anvi'o side as expected as far as I can tell.

Best,

okcoskun commented 4 years ago

Thank you for the prompt reply! I really appreciate it.

Himmmm, If I get it right, BAM files should be produced for each sample using the same FASTA file. So, this should be my fault.

I think it is okay to add images here taken from CLC. If not, please delete it, since I couldn't see any instructions prohibiting uploading images from a commercial software. So let me explain my workflow using these.

1) I analyzed my metagenomes separately. Then right clicked to get the BAM files and FASTA files from each assembly subset files.
Seperately analyzed fasta files

2) So, now I have two FASTA files and two BAM files. Then, I concatenate two exported fasta files (The first step in the issue).

cat TwoMetagenomes/Assembly/*.fa >contigs_combined.fasta

3) Use Anvio metagenomic workflow.

Since I am using separate BAM files, they have different number of contigs. But if you sum up the number of contigs in the BAM files (6,246 + 2,410 = 8656), It matches to the number of contigs in the concatenated contigs.fa.

Contigs with at least one gene call ..........: 8656 of 8656 (100.0%)                                                                               
Contigs database .............................: A new database, contigs.db, has been created.
Number of contigs ............................: 8,656
Number of splits .............................: 8,887

It seems that I am making a simple mistake, but might be helpful for the community who is just starting the metagenomic analysis and want to use this beautifully designed software -- Anvi'o.

Thanks in advance for your comment, again.

Ömer

ekiefl commented 4 years ago

Himmmm, If I get it right, BAM files should be produced for each sample using the same FASTA file.

Exactly.

So, now I have two FASTA files and two BAM files. Then, I concatenate two exported fasta files

As you identified, this is the problem. You should concatenate the two FASTA files. Then you should produce 2 BAM files from the concatenated FASTA.

okcoskun commented 4 years ago

Thank you very much Evan! Sorry for taking Murat's and your time for this very simple mistake.

Ömer

ekiefl commented 4 years ago

No problem. Glad it is clear to you now.

meren commented 4 years ago

So, now I have two FASTA files and two BAM files. Then, I concatenate two exported fasta files

As you identified, this is the problem. You should concatenate the two FASTA files. Then you should produce 2 BAM files from the concatenated FASTA.

In fact it is just a little more than that.

I presume those two FASTA files are the assembly of each metagenome independently. If you simply concatenate the two FASTA files and do read recruitment, you will run into another, more of a theoretical, problem: the likely redundancy of contigs in both files (and there is no easy solution for that).

So the concatenation of the two metagenomes should take place before the assembly. You should combine all metagenomes involved (i.e., by concatenating R1s and R2s, or by asking assembler to use all related FASTQ files), and then use the resulting FASTA file of contigs to do independent read recruitments (this strategy is typically referred to as "co-assembly" and has advantages and disadvantages compared to single assemblies).

Best,