merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
440 stars 145 forks source link

`anvi-profile` error #2047

Closed tarunaaggarwal closed 1 year ago

tarunaaggarwal commented 1 year ago

Hello - I'm trying to run anvi-profile with 6 bam files and a contigs.db using the following command and got a similar looking error multiple times that I don't understand. My for loop is below and the error file is attached. Can someone please help me resolve these errors? Thank you so much!

#!/bin/bash

#SBATCH --job-name="anvioProfiles4OA"
#SBATCH --nodes=1
#SBATCH --ntasks=12
#SBATCH --time=30-00:00:00
#SBATCH --mail-user=taruna@ucsb.edu
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH -e slurm-%j.err-%N
#SBATCH -o slurm-%j.out-%N

Assign dir variables
ANVIOUT=/home/taruna/sargassum/metagenomes/clean_data/anvio-7.1-results/OA

source ~/.bash_profile
conda activate anvio-7.1

sort and index bam files
samplePrefixes=("OA" "OB" "OC" "SA" "SB" "SC")
for sampleprefix in ${samplePrefixes[@]}; do 
    anvi-profile \
    -i ${ANVIOUT}/sorted-bams/OAasmSimple-vs-${sampleprefix}reads.sorted.indexed.bam \
    -c ${ANVIOUT}/contigs.db \
    --output-dir ${ANVIOUT}/profiles_15Feb2023/OAasmSimplevs${sampleprefix} \
    --sample-name OAasmSimplevs${sampleprefix}reads \
    --num-threads 12 \
    --profile-SCVs \
    --min-contig-length 2500 \
    --min-mean-coverage 5
done

slurm-517415-err-node59.txt

metehaansever commented 1 year ago

As I understand from the error message, your table does not comply with the following rule; Each item in the view data list must be a list or tuple with three items (item name, layer name, and data value). error line 43 If you have used less or more than 3 items, you need to fix this.

meren commented 1 year ago

Hey @metehaansever, that is a bug on anvi'o side, there is nothing @tarunaaggarwal can do to fix it :) anvi'o profiling step generates a data product, passes it to another function that expects a certain format for every item, and the procedure fails because the data doesn't fit the expected structure.

@tarunaaggarwal, can you confirm that this error doesn't happen when you remove the flag --min-mean-coverage 5?

Thank you for the report,

tarunaaggarwal commented 1 year ago

Hi @meren - thank you for responding! I'm rerunning the script now. It takes about a 1/1.5 days to run because I have lots of host contigs in my assembly, not just microbial (an unfortunate consequence of working with macroalgae). But I will write back as soon as it finishes. Thank you again for your help!

meren commented 1 year ago

Thank you for your patience with anvi'o!

tarunaaggarwal commented 1 year ago

Dear @meren - I can confirm that removing --min-mean-coverage did not result in any errors. Hope this information helps.

FlorianTrigodet commented 1 year ago

Here is a quick reproducible example with this contig.db and bam file: test.tar.gz

There is an error when both --min-mean-coverage and -T are used simultaneously:

anvi-profile -c CONTIGS.db -i SAMPLE-03.bam -o PROFILE --profile-SCVs -T 3 -S toto --min-mean-coverage 5

No error if you use one, or the other exclusively. Strange: seems to work fine with -T 2

meren commented 1 year ago

I can reproduce this and can see where the issue is. Will take care of it soon :)

meren commented 1 year ago

OK. The more I think about this the more I am convinced that we must remove the --min-mean-coverage flag altogether.

The reason to that is the following: in the vast majority of metagenomics projects, coverages of contigs will change from sample to sample. As a result, the --min-mean-coverage flag will end up removing different sets of contigs from each single profile database... And when the user wants to merge those profile databases, they will get the following error:

Config Error: Ouch. The number of nucleotides described are not identical for all profiles to
              be merged, which is a deal breaker. All profiles that are going to be merged
              must be run with identical flags and parameters :/ You really shouldn't but if
              you want to try to force things because you believe this is due to a
              misunderstanding, you can use the flag --force. While you are considering this
              as an option, please also remember that this we advice against it..

(note: the --force works for merging, but then other downstream processes fail inevitably .. because data for some contigs are present in only a subset of resulting single profiles (this is a problem we could solve in theory, but in practice the necessary design will complicate things way too much in the codebase without any actual gain since I have a better solution for this)).

So. What is the solution when wants to remove contigs of bad coverage everywhere from the project altogehter?

A better solution is to run anvi-profile-blitz on all BAM files, remove contigs based on their coverages reported in the resulting bam-stats-txt file, create a final contigs-of-interest.txt file from that, and pass it to the anvi-profile with the flag --contigs-of-interest.

This alternative works perfectly as one would expect, and is a much more explicit way to focus on a subset of contigs for profiling.

I hope this makes sense. I will now make the necessary changes to say goodbye to our good old --min-mean-coverage flag.

Thank you very much for bringing this to our attention, @tarunaaggarwal. This flag worked well when we didn't have anvi-profile-blitz for single profiles, but it clearly became obsolete over time as anvi'o evolved.

FlorianTrigodet commented 1 year ago

As a result, the --min-mean-coverage flag will end up removing different sets of contigs from each single profile database...

I thought this option would set the coverage to 0, but it actually filters out/remove contigs to consider in the profile.db?

How about moving that argument to anvi-merge? You profile everything per sample, then for the final merged profile.db you can filter out contigs with min thresholds. Easier for the user than having to run the blitz profiler. Of course if you want to filter contigs for min-mean-coverage to reduce computing time, you should use the blitz profiler as you described it.

meren commented 1 year ago

I thought this option would set the coverage to 0, but it actually filters out/remove contigs to consider in the profile.db?

Yes indeed.

How about moving that argument to anvi-merge? You profile everything per sample, then for the final merged profile.db you can filter out contigs with min thresholds.

This is a very good idea, but it will lead to many other issues since individual single profiles will not know which contigs they exclude, and it will create inconsistencies across profile databases that are linked to the same contigs-db in a manner that we can't track explicitly with the information stored in self tables. I don't think that will make things any better.

Also, adding more structural burden on anvi-merge is not a good solution in general. The user should explicitly define which contigs they wish to profile except those properties that are common across all samples (such as contig length) so everything is more reproducible.

But still, if there is too much protest we can add something to anvi-merge :)