metaGmetapop / metapop

A pipeline for the macro- and micro-diversity analyses and visualization of metagenomic-derived populations
MIT License
37 stars 10 forks source link

local variable 'contig' referenced before assignment #25

Closed wwood closed 1 year ago

wwood commented 1 year ago

Hi,

I'm looking to benchmark metapop against some other variant callers, and am running into some issues unfortunately.

Installation was pretty straightforward through conda, I'm using the 0.0.60 version on PyPI. Since all the dependencies are available on conda already, creating a proper package so you can conda install -c bioconda metapop I imagine wouldn't be hard either. But anyway, I have my install, thanks for making that relatively painless.

I'm using 1 sample and 1 genome. I also just made up a norm file:

$ echo sample1 1000000 |sed 's/ /\t/' >data/10/metapop/norm

Since there's only 1 sample I chose 1 million arbitrarily since I guess it doesn't matter?

But then ran into some issues running it:

$ metapop --input_samples data/10/metapop/bams --reference GCF_003697165.2_genomic.fna                                                                                                                                                                       Using: 128 threads.
Automatically generated normalization file already found!
Combined genomes file already found.
Genes will be predicted with prodigal and placed in Genomes and Genes.                                                                                                        
Error:  no input sequences to analyze.                                                                                                                                        
Traceback (most recent call last):                                                                                                                                              File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/bin/metapop", line 10, in <module>                sys.exit(main())                                                                                                                                                            File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_main.py", line 251, in main
    mag_contig_dict, mag_length_dict = metapop.metapop_helper_functions.create_mag_log(reference_fasta, output_directory_base+"/MetaPop/00.Log_and_Parameters/mags.txt", treat_as_mag)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_helper_functions.py", line 114, in create_mag_log
    ref_files = dir_to_fastas(fasta_dir)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_helper_functions.py", line 8, in dir_to_fastas
    for file in os.listdir(directory):
NotADirectoryError: [Errno 20] Not a directory: 'GCF_003697165.2_genomic.fna'

I suppose I read the help message wrong, so making that into a directory of fasta files

$ cp GCF_003697165.2_genomic.fna data/10/metapop/genomes/GCF_003697165.2_genomic.fasta
$ metapop --input_samples data/10/metapop/bams --reference `pwd`/data/10/metapop/genomes --threads 1 --norm data/10/metapop/norm --no_macro
Using: 1 threads.
Combined genomes file already found.
Automatically predicted genes found. Skipping Prediction.
MetaPop Started at: 29/03/2023 13:26:41

Checking: sample1 for correctness... file passed.
Pre-formatting: sample1 started at: 29/03/2023 13:26:41... finished at 29/03/2023 13:26:47
Preparing percent identity for: sample1 starting at 29/03/2023 13:26:47
samtools calmd: Failed to open reference file './MetaPop/01.Genomes_and_Genes/all_genomes.fasta': No such file or directory
Percent identity prepared for: sample1 finished at 29/03/2023 13:26:47
Sorting: sample1 started at: 29/03/2023 13:26:47... finished at 29/03/2023 13:26:47
Filtering genomes. Started at: 29/03/2023 13:26:47
Filtering reads by length and percent identity for: sample1 starting at: 29/03/2023 13:26:47
Getting breadth and depth of coverage for: sample1 starting at: 29/03/2023 13:26:47
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_filter.py", line 292, in parse_reads
    breadth_and_depth_output(current_depths, contig_length_dict[contig], contig, truncation, breadth_and_depth_fh)
UnboundLocalError: local variable 'contig' referenced before assignment
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/bin/metapop", line 10, in <module>
    sys.exit(main())
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_main.py", line 297, in main
    metapop.metapop_filter.filt(original_bams, filter_command_base, bdb, threads, mag_contig_dict, mag_length_dict, joined_fastas)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_filter.py", line 167, in filt
    p.map(parse_reads, commands)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
UnboundLocalError: local variable 'contig' referenced before assignment

I'm not sure how to fix that. I tried changing the extension of the fasta file (fna and fasta) but got the same error either way.

Any ideas?

Also, I'm interested in benchmarking runtime, and I don't need per-gene info. What is the fastest way to get a list of predicted SNVs out? It wasn't clear to me, what is the output file? I'm hoping to get a VCF to make comparison easy, so any advice on that would be helpful.

Thanks, ben

metaGmetapop commented 1 year ago

Hi Ben - Sorry for the late response. We're currently doing a major overhaul of the pipeline into Nextflow and have been bad at checking the Github.

Is the genome you're using all in one contig? This is probably messing up the all_genomes.fasta creation. We assume that the path will include mutliple fasta files that are MAGs or a single fasta file with a lot of contigs with each representing a different organisms.

For the SNV data creation, we automatically do it per gene. I don't think we have an easy flag to just get the SNVs at the genome level at this point.

-A

wwood commented 1 year ago

Hi Anne,

Thanks for the reply. I'm using GCF_003697165.2 which is a complete genome, but is a chromosome and plasmid, so 2 contigs.

I will try using another dummy sequence set in another fasta file and see if that works.

wwood commented 1 year ago

That seemed to help a bit, thanks for the tip. But then ran into this. Why is the sample being filtered out?

$ ls refs
GCF_003697165.2_genomic.fna  random.fna
$ metapop --input_samples data/coverage/10/metapop/bams --reference `pwd`/refs  --threads 8
       Using: 8 threads.
Automatically generated normalization file already found!
Combined genomes file already found.
Automatically predicted genes found. Skipping Prediction.
MetaPop Started at: 05/04/2023 07:28:12

Checking: sample1 for correctness...  file passed.
Pre-formatting: sample1 started at: 05/04/2023 07:28:12...  finished at 05/04/2023 07:28:13
Preparing percent identity for: sample1 starting at 05/04/2023 07:28:13
   Percent identity prepared for: sample1 finished at 05/04/2023 07:28:16
Sorting: sample1 started at: 05/04/2023 07:28:16 finished at 05/04/2023 07:28:17
Filtering genomes. Started at: 05/04/2023 07:28:17
Filtering reads by length and percent identity for: sample1 starting at: 05/04/2023 07:28:17
       Getting breadth and depth of coverage for: sample1 starting at: 05/04/2023 07:28:22
                               No genomes survived filtering for: sample1. There will be no further output for this file.
Completed preprocessing for: sample1 at: 05/04/2023 07:28:28
MetaPop preprocessing finished at: 05/04/2023 07:28:28

Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/bin/metapop", line 10, in <module>
    sys.exit(main())
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_main.py", line 313, in main
    joined_fastas, reference_genes = metapop.metapop_snp_call.call_variant_positions(output_directory_base, joined_fastas, min_obs, min_q, min_pct, threads, ref_samp, reference_genes)
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_snp_call.py", line 76, in call_variant_positions
    pool = multiprocessing.Pool(min(threads, num_samps))
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/multiprocessing/pool.py", line 205, in __init__
    raise ValueError("Number of processes must be at least 1")
ValueError: Number of processes must be at least 1

There is enough breadth on the real genome I think, and imagine 9.9x coverage is enough, no?

$ coverm genome --genome-fasta-files refs/* -b data/coverage/10/metapop/bams/sample1.bam -m mean covered_fraction --output-format sparse
[2023-04-04T21:57:55Z INFO  coverm] CoverM version 0.6.1
[2023-04-04T21:57:55Z INFO  coverm] Using min-covered-fraction 10%
[2023-04-04T21:57:55Z INFO  coverm::genome] Of 6 reference IDs, 6 were assigned to a genome and 0 were not
[2023-04-04T21:57:55Z INFO  coverm::genome] In sample 'sample1', found 335560 reads mapped out of 335560 total (100.00%)

Sample  Genome  Mean    Covered Fraction
sample1 GCF_003697165.2_genomic 9.987179        0.99937
sample1 random  0       0
metaGmetapop commented 1 year ago

Hey Ben - I am going to pass this part over the Kenji. We'll get back to you soon

metaGmetapop commented 1 year ago

Hi the issues is that your genome has 9.9x coverage and would be a fail under our default conditions. 10x is the minimum we have set as the default.

wwood commented 1 year ago

OK, thanks, that would make sense. That seems quite conservative to me, especially if the reads do not come from the same set of samples as the reference (e.g. comparing a metagenome to a reference from GTDB). In my benchmarking the better tools can get recall and precision > 0.97 even when coverage is ~5x for a particular strain.

But OK. Thanks for the helpful responses. Good luck with the nextflow.

wwood commented 1 year ago

Sorry, I could have phrased that previous comment better. I meant

the better tools

in relation to others that are in the benchmark, not in relation to metapop.