merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

[BUG] anvi-split need `variability_splits` in profile.db #2288

Open soojunglee98 opened 6 days ago

soojunglee98 commented 6 days ago

Running anvi-split but cannot proceed because PROFILE-refined.db does not seem to have a table named `variability_splits

anvi'o version

Anvi'o .......................................: marie (v8) Python .......................................: 3.10.13 Profile database .............................: 38 Contigs database .............................: 21 Pan database .................................: 16 Genome data storage ..........................: 7 Auxiliary data storage .......................: 2 Structure database ...........................: 2 Metabolic modules database ...................: 4 tRNA-seq database ............................: 2

anvi-self-test --version

Detailed description of the issue

Here’s a structured way to present your situation:

  1. I tried to manually refine the bins (Collection: METABAT) using anvi-refine.
  2. When I attempted to refine a specific bin, the job was killed.
  3. I discovered that if a bin has many contigs, it could cause the job to be killed. Therefore, I need to separate the profile.db for each bin. This particular bin has 100% completeness but 5094.29% contamination. However, when I checked the taxonomy result using another tool, there was a taxonomy assigned (based on 368,704/389,499 ORFs).
  4. I attempted to use anvi-split, but I encountered this error message:
    Config Error: The database at PROFILE-refined.db does not seem to have a table named          
    `variability_splits` :/ Here is a list of table names this database knows: self,
    item_additional_data, item_orders, layer_additional_data, layer_orders,         
    variable_nucleotides, variable_codons, indels, views, collections_info,         
    collections_bins_info, collections_of_contigs, collections_of_splits, states,   
    std_coverage_contigs, mean_coverage_contigs, mean_coverage_Q2Q3_contigs,        
    detection_contigs, abundance_contigs, std_coverage_splits, mean_coverage_splits,
    mean_coverage_Q2Q3_splits, detection_splits, abundance_splits 
  5. How do I fix this issue? Do I need to redo the steps, which is quite frustrating...?
ivagljiva commented 2 days ago

Hey @soojunglee98 ,

this is a strange error, because it doesn't look like variability_splits is a table we use anymore. So it is a good thing that your profile database does not have this table, but therefore strange that anvi-split is looking for this table.

Could you re-run your command with the --debug flag and paste the error traceback here so we can see exactly where the code is failing?

soojunglee98 commented 1 day ago

Hi, I ran again but got the same error....

start

ANVI'O TRICKY OPERATIONS DEPARTMENT

Anvi'o is about to start splitting your bins into individual, self-contained anvi'o profiles. As of 2021, we have tested this feature quite extensively and we trust that it will do well. But this is still quite a tricky operation and you must double-check things once your split data is ready.

Contigs DB ...................................: Initialized: contigs-refined.db (v. 21)

WARNING

ProfileSuperClass found a collection focus, which means it will be initialized using only the splits in the profile database that are affiliated with the collection METABAT and all bins it describes.

Profile Super ................................: Initialized with 563858 of 1052137 splits: PROFILE-refined.db (v. 38)

THE MORE YOU KNOW 🌈

Someone asked the Contigs Superclass to initialize only a subset of contig sequences. Usually this is a good thing and means that some good code somewhere is looking after you. Just FYI, this class will only know about 521,529 contig sequences instead of all the things in the database.

/home/lsoojung/miniconda3/envs/anvio-7/lib/python3.10/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator DecisionTreeClassifier from version 0.24.0 when using version 1.2.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations warnings.warn( /home/lsoojung/miniconda3/envs/anvio-7/lib/python3.10/site-packages/sklearn/base.py:318: UserWarning: Trying to unpickle estimator RandomForestClassifier from version 0.24.0 when using version 1.2.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations warnings.warn(

WARNING

It seems you have more than 20,000 splits in this particular bin. This is the soft limit for anvi'o to attempt to create a hierarchical clustering of your splits (which becomes the center tree in all anvi'o displays). If you want a hierarchical clustering to be done anyway, you can re-run the splitting process only for this bin by adding these parameters to your run: '--bin-id METABAT_1-contigs --enforce-hierarchical-clustering'. If you feel like you are lost, don't hesitate to get in touch with anvi'o developers.

Merged database ..............................: True

Traceback for debugging

File "/home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-split", line 66, in splitter.DBSplitter(args).get()(args).process() File "/home/lsoojung/miniconda3/envs/anvio-7/lib/python3.10/site-packages/anvio/splitter.py", line 225, in process b.do_profile_db() File "/home/lsoojung/miniconda3/envs/anvio-7/lib/python3.10/site-packages/anvio/splitter.py", line 658, in do_profile_db self.migrate_data(tables, self.profile_db_path, self.bin_profile_db_path) File "/home/lsoojung/miniconda3/envs/anvio-7/lib/python3.10/site-packages/anvio/splitter.py", line 301, in migrate_data data = source_db.get_some_rows_from_table(table_name, where_clause) File "/home/lsoojung/miniconda3/envs/anvio-7/lib/python3.10/site-packages/anvio/db.py", line 497, in get_some_rows_from_table self.is_table_exists(table_name) File "/home/lsoojung/miniconda3/envs/anvio-7/lib/python3.10/site-packages/anvio/db.py", line 482, in is_table_exists raise ConfigError(f"The database at {self.db_path} does not seem to have a table named {table_name} :/ "

Config Error: The database at PROFILE-refined.db does not seem to have a table named
variability_splits :/ Here is a list of table names this database knows: self, item_additional_data, item_orders, layer_additional_data, layer_orders,
variable_nucleotides, variable_codons, indels, views, collections_info,
collections_bins_info, collections_of_contigs, collections_of_splits, states,
std_coverage_contigs, mean_coverage_contigs, mean_coverage_Q2Q3_contigs,
detection_contigs, abundance_contigs, std_coverage_splits, mean_coverage_splits, mean_coverage_Q2Q3_splits, detection_splits, abundance_splits

done

ivagljiva commented 1 day ago

Hi @soojunglee98 , thanks for sending the error traceback. It helped me figure out where to look in the codebase to find out what could be going wrong.

What is happening in the anvi'o codebase is that anvi'o has a list of tables inside the profile database that it needs to extract data from (only the data that is relevant to each bin) and then store in the new, bin-specific profile databases. How does it know which tables to process? In splitter.py, it gets a list of the tables required by profile DBs from constants.essential_data_fields_for_anvio_profiles:

for table_name in constants.essential_data_fields_for_anvio_profiles:
            for target in ['splits', 'contigs']:
                new_table_name = '_'.join([table_name, target])
                new_table_structure = t.view_table_structure
                new_table_types = t.view_table_types
                bin_profile_db.db.create_table(new_table_name, new_table_structure, new_table_types)

                tables[new_table_name] = ('item', self.split_names)

constants.essential_data_fields_for_anvio_profiles contains the following list of tables:

essential_data_fields_for_anvio_profiles = ['std_coverage',
                                            'mean_coverage',
                                            'mean_coverage_Q2Q3',
                                            'detection',
                                            'abundance',
                                            'variability']

And as you can see in the loop I pasted before, we add either '_splits' or '_contigs' to the end of those table names. So that means that I have to take back what I said before:

this is a strange error, because it doesn't look like variability_splits is a table we use anymore.

^^ that was wrong. We do use the variability_splits table. I just didn't find it in the codebase before because we usually don't hard-code the full table name. Sorry about the confusion.

What does this mean for your situation? It means that your profile database is missing an essential table of data. There was probably an error earlier in your workflow. Probably anvi-profile was killed prematurely leaving you with a partial database (if I had to guess, likely the profile job ran out of memory because your dataset is quite large).

If you take a look at the logs (or terminal output) from when you ran anvi-profile, you might be able to figure out what happened to result in an unfinished profile database. But regardless, the solution is to run anvi-profile again and make sure that it finishes successfully (and contains the variability_splits and variability_contigs tables) before trying to run anvi-split.

soojunglee98 commented 1 day ago

Just to make sure before I re-run the script... Can you look at the log of the PROFILE.db and let me know what kind of error happened? Thank you so much for your help!!! This is the log when I ran

anvi-merge *_sorted/PROFILE.db -o SAMPLES-MERGED -c /scratch/raskin_root/raskin1/lsoojung/Metaspades/anvio/contigs.db

And this is the log

WARNING

It seems you have more than 20,000 splits in your samples to be merged. This is the soft limit for anvi'o to attempt to create a hierarchical clustering of your splits (which becomes the center tree in all anvi'o displays). If you want a hierarchical clustering to be done anyway, please see the flag --enforce- hierarchical-clustering. But more importantly, please take a look at the anvi'o tutorial to make sure you know your better options to analyze large metagenomic datasets with anvi'o.

profiler_version .............................: 38 output_dir ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/SAMPLES-MERGED sample_id ....................................: SAMPLES_MERGED description ..................................: None profile_db ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/SAMPLES-MERGED/PROFILE.db merged .......................................: True contigs_db_hash ..............................: hash18ad4680 num_runs_processed ...........................: 133 merged_sample_ids ............................: s10_sorted, s14_sorted, s16_sorted, s1_sorted, s2_sorted, s3_sorted, s4_sorted, s5_sorted, s6_sorted, s7_sorted, s8_sorted, s9_sorted fetch_filter .................................: None, None, None, None, None, None, None, None, None, None, None, None min_percent_identity .........................: 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 Common layer additional data keys ............: default total_reads_mapped ...........................: 165197268, 267799958, 117167554, 146112257, 166750315, 226980673, 241437744, 164637598, 163144391, 202208261, 215063249, 174011014 cmd_line .....................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-merge s10_sorted/PROFILE.db s14_sorted/PROFILE.db s16_sorted/PROFILE.db s1_sorted/PROFILE.db s2_sorted/PROFILE.db s3_sorted/PROFILE.db s4_sorted/PROFILE.db s5_sorted/PROFILE.db s6_sorted/PROFILE.db s7_sorted/PROFILE.db s8_sorted/PROFILE.db s9_sorted/PROFILE.db -o SAMPLES-MERGED -c /scratch/raskin_root/raskin1/lsoojung/Metaspades/anvio/contigs.db clustering_performed .........................: False

WARNING

SNVs were not profiled, variable nucleotides positions tables will be empty in the merged profile database.

WARNING

Codon frequencies were not profiled, hence, these tables will be empty in the merged profile database.

WARNING

Indels were not profiled, hence, these tables will be empty in the merged profile database.

Auxiliary Data ...............................: Found: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/SAMPLES-MERGED/AUXILIARY-DATA.db (v. 2) Profile Super ................................: Initialized with all 1052137 splits: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/SAMPLES-MERGED/PROFILE.db (v. 38)

Layer orders added

Data groups added

soojunglee98 commented 1 day ago

Or if you need the log file for anvi-profile.. This is the one (but kinda long...)

1 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/1.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/1.sorted.bam.bai 1 bam finished Sample name set ...................................: s1_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s1_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 1.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/1.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s1_sorted

Number of reads in the BAM file ...................: 146,112,257 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:37:44.985820 1 done 2 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/2.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/2.sorted.bam.bai 2 bam finished Sample name set ...................................: s2_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s2_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 2.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/2.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s2_sorted

Number of reads in the BAM file ...................: 166,750,315 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:37:40.742844 2 done 3 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/3.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/3.sorted.bam.bai 3 bam finished Sample name set ...................................: s3_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s3_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 3.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/3.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s3_sorted

Number of reads in the BAM file ...................: 226,980,673 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:35:53.135805 3 done 4 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/4.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/4.sorted.bam.bai 4 bam finished Sample name set ...................................: s4_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s4_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 4.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/4.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s4_sorted

Number of reads in the BAM file ...................: 241,437,744 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:50:20.496689 4 done 5 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/5.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/5.sorted.bam.bai 5 bam finished Sample name set ...................................: s5_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s5_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 5.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/5.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s5_sorted

Number of reads in the BAM file ...................: 164,637,598 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:40:41.602470 5 done 6 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/6.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/6.sorted.bam.bai 6 bam finished Sample name set ...................................: s6_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s6_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 6.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/6.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s6_sorted

Number of reads in the BAM file ...................: 163,144,391 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:35:50.313355 6 done 7 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/7.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/7.sorted.bam.bai 7 bam finished Sample name set ...................................: s7_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s7_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 7.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/7.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s7_sorted

Number of reads in the BAM file ...................: 202,208,261 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:34:31.975499 7 done 8 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/8.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/8.sorted.bam.bai 8 bam finished Sample name set ...................................: s8_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s8_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 8.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/8.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s8_sorted

Number of reads in the BAM file ...................: 215,063,249 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:33:58.446976 8 done 9 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/9.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/9.sorted.bam.bai 9 bam finished Sample name set ...................................: s9_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s9_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 9.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/9.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s9_sorted

Number of reads in the BAM file ...................: 174,011,014 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:34:15.235434 9 done 10 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/10.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/10.sorted.bam.bai 10 bam finished Sample name set ...................................: s10_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s10_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 10.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/10.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s10_sorted

Number of reads in the BAM file ...................: 165,197,268 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:34:40.290865 10 done 14 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/14.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/14.sorted.bam.bai 14 bam finished Sample name set ...................................: s14_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s14_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 14.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/14.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s14_sorted

Number of reads in the BAM file ...................: 267,799,958 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:37:21.364256 14 done

16 Sorted BAM File ..............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/16.sorted.bam BAM File Index ...............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/16.sorted.bam.bai 16 bam finished Sample name set ...................................: s16_sorted Description .......................................: None Profile DB path ...................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s16_sorted/PROFILE.db Contigs DB path ...................................: contigs.db Contigs DB hash ...................................: hash18ad4680 Command line ......................................: /home/lsoojung/miniconda3/envs/anvio-7/bin/anvi-profile -i 16.sorted.bam -c contigs.db -T 16 --skip-SNV-profiling -M 1000

Minimum percent identity of reads to be profiled ..: None Fetch filter engaged ..............................: None

Is merged profile? ................................: False Is blank profile? .................................: False Skip contigs shorter than .........................: 1,000 Skip contigs longer than ..........................: 9,223,372,036,854,775,807 Perform hierarchical clustering of contigs? .......: False

Profile single-nucleotide variants (SNVs)? ........: False Profile single-codon variants (SCVs/+SAAVs)? ......: False Profile insertion/deletions (INDELs)? .............: False Minimum coverage to calculate SNVs ................: 10 Report FULL variability data? .....................: False

WARNING

Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take into consideration anything below that. If you need to kill this an restart your analysis with another minimum contig length value, feel free to press CTRL+C.

Input BAM .........................................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/16.sorted.bam Output directory path .............................: /gpfs/accounts/raskin_root/raskin1/lsoojung/Metaspades/anvio/s16_sorted

Number of reads in the BAM file ...................: 117,167,554 Number of sequences in the contigs DB .............: 1,007,698 Number of contigs to be conisdered (after -M) .....: 1,007,698 Number of splits ..................................: 1,052,137 Number of nucleotides .............................: 4,248,216,033

Additional data added to the new profile DB .......: total_reads_mapped, total_reads_kept

βœ“ anvi-profile took 0:45:29.501650 16 done

ivagljiva commented 17 hours ago

Ah, it was not an error with the profiling after all. Instead, your logs show that you ran anvi-profile with the flag --skip-SNV-profiling, and therefore the variability tables were not created in the database.

I was able to replicate the issue with the small test case included with the anvi'o codebase:

anvi-gen-contigs-database -f ~/software/anvio/anvio/tests/sandbox/contigs.fa -o TEST_CONTIGS.db

anvi-init-bam ~/software/anvio/anvio/tests/sandbox/SAMPLE-01-RAW.bam
mv ~/software/anvio/anvio/tests/sandbox/SAMPLE-01-RAW.bam-sorted.bam* .
anvi-profile -c TEST_CONTIGS.db -i SAMPLE-01-RAW.bam-sorted.bam -o TEST_PROFILE --skip-SNV-profiling --skip-INDEL-profiling -T 4

anvi-init-bam ~/software/anvio/anvio/tests/sandbox/SAMPLE-02-RAW.bam
mv ~/software/anvio/anvio/tests/sandbox/SAMPLE-02-RAW.bam-sorted.bam* .
anvi-profile -c TEST_CONTIGS.db -i SAMPLE-02-RAW.bam-sorted.bam -o TEST_PROFILE --skip-SNV-profiling --skip-INDEL-profiling -T 4

anvi-merge TEST_PROFILE*/PROFILE.db -c TEST_CONTIGS.db -o MERGED

anvi-script-add-default-collection -c TEST_CONTIGS.db -p MERGED/PROFILE.db
anvi-split -c TEST_CONTIGS.db -p MERGED/PROFILE.db -C DEFAULT -o TEST_SPLIT

And then I got the same error you did:

Config Error: The database at MERGED/PROFILE.db does not seem to have a table named `variability_splits`

The error does not occur if you run the same workflow without using the --skip-SNV-profiling flag for anvi-profile, for instance if you run with default flags, or if you skip only INDEL profiling (or SCV profiling). So SNV profiling is the only thing that determines whether or not the variability_* tables are created in the database, and whether anvi-split works downstream.

I think this should be considered a bug. anvi-split should be able to split profile databases even if you skipped the SNV profiling upstream. There are two ways to address this in the codebase:

  1. we change anvi-profile so that it adds empty variability_* tables to the database even when you skip variability profiling
  2. we change anvi-split so that it conditionally expects to find variability_* tables only if variability profiling was run, ie with a check for SNVs_profiled being set to 0 in the profile self table.

Option (2) sounds like a better solution to me. I will try to implement it and will update here afterwards.

ivagljiva commented 17 hours ago

Okay, I think it is fixed now. At least, it works on my test case when I use profiles created with the --skip-SNV-profiling flag.

@soojunglee98 , if you install the development version of anvi'o, then you could use it to run anvi-split again and it should hopefully work.