Closed wwood closed 1 year ago
Hi Ben - Sorry for the late response. We're currently doing a major overhaul of the pipeline into Nextflow and have been bad at checking the Github.
Is the genome you're using all in one contig? This is probably messing up the all_genomes.fasta creation. We assume that the path will include mutliple fasta files that are MAGs or a single fasta file with a lot of contigs with each representing a different organisms.
For the SNV data creation, we automatically do it per gene. I don't think we have an easy flag to just get the SNVs at the genome level at this point.
-A
Hi Anne,
Thanks for the reply. I'm using GCF_003697165.2 which is a complete genome, but is a chromosome and plasmid, so 2 contigs.
I will try using another dummy sequence set in another fasta file and see if that works.
That seemed to help a bit, thanks for the tip. But then ran into this. Why is the sample being filtered out?
$ ls refs
GCF_003697165.2_genomic.fna random.fna
$ metapop --input_samples data/coverage/10/metapop/bams --reference `pwd`/refs --threads 8
Using: 8 threads.
Automatically generated normalization file already found!
Combined genomes file already found.
Automatically predicted genes found. Skipping Prediction.
MetaPop Started at: 05/04/2023 07:28:12
Checking: sample1 for correctness... file passed.
Pre-formatting: sample1 started at: 05/04/2023 07:28:12... finished at 05/04/2023 07:28:13
Preparing percent identity for: sample1 starting at 05/04/2023 07:28:13
Percent identity prepared for: sample1 finished at 05/04/2023 07:28:16
Sorting: sample1 started at: 05/04/2023 07:28:16 finished at 05/04/2023 07:28:17
Filtering genomes. Started at: 05/04/2023 07:28:17
Filtering reads by length and percent identity for: sample1 starting at: 05/04/2023 07:28:17
Getting breadth and depth of coverage for: sample1 starting at: 05/04/2023 07:28:22
No genomes survived filtering for: sample1. There will be no further output for this file.
Completed preprocessing for: sample1 at: 05/04/2023 07:28:28
MetaPop preprocessing finished at: 05/04/2023 07:28:28
Traceback (most recent call last):
File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/bin/metapop", line 10, in <module>
sys.exit(main())
File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_main.py", line 313, in main
joined_fastas, reference_genes = metapop.metapop_snp_call.call_variant_positions(output_directory_base, joined_fastas, min_obs, min_q, min_pct, threads, ref_samp, reference_genes)
File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/site-packages/metapop/metapop_snp_call.py", line 76, in call_variant_positions
pool = multiprocessing.Pool(min(threads, num_samps))
File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/multiprocessing/context.py", line 119, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild,
File "/mnt/hpccs01/work/microbiome/lorikeet/mess/1_single_genome_benchmark/.snakemake/conda/da449939095916bafb4160c7e53b94b2_/lib/python3.10/multiprocessing/pool.py", line 205, in __init__
raise ValueError("Number of processes must be at least 1")
ValueError: Number of processes must be at least 1
There is enough breadth on the real genome I think, and imagine 9.9x coverage is enough, no?
$ coverm genome --genome-fasta-files refs/* -b data/coverage/10/metapop/bams/sample1.bam -m mean covered_fraction --output-format sparse
[2023-04-04T21:57:55Z INFO coverm] CoverM version 0.6.1
[2023-04-04T21:57:55Z INFO coverm] Using min-covered-fraction 10%
[2023-04-04T21:57:55Z INFO coverm::genome] Of 6 reference IDs, 6 were assigned to a genome and 0 were not
[2023-04-04T21:57:55Z INFO coverm::genome] In sample 'sample1', found 335560 reads mapped out of 335560 total (100.00%)
Sample Genome Mean Covered Fraction
sample1 GCF_003697165.2_genomic 9.987179 0.99937
sample1 random 0 0
Hey Ben - I am going to pass this part over the Kenji. We'll get back to you soon
Hi the issues is that your genome has 9.9x coverage and would be a fail under our default conditions. 10x is the minimum we have set as the default.
OK, thanks, that would make sense. That seems quite conservative to me, especially if the reads do not come from the same set of samples as the reference (e.g. comparing a metagenome to a reference from GTDB). In my benchmarking the better tools can get recall and precision > 0.97 even when coverage is ~5x for a particular strain.
But OK. Thanks for the helpful responses. Good luck with the nextflow.
Sorry, I could have phrased that previous comment better. I meant
the better tools
in relation to others that are in the benchmark, not in relation to metapop.
Hi,
I'm looking to benchmark metapop against some other variant callers, and am running into some issues unfortunately.
Installation was pretty straightforward through conda, I'm using the 0.0.60 version on PyPI. Since all the dependencies are available on conda already, creating a proper package so you can
conda install -c bioconda metapop
I imagine wouldn't be hard either. But anyway, I have my install, thanks for making that relatively painless.I'm using 1 sample and 1 genome. I also just made up a norm file:
Since there's only 1 sample I chose 1 million arbitrarily since I guess it doesn't matter?
But then ran into some issues running it:
I suppose I read the help message wrong, so making that into a directory of fasta files
I'm not sure how to fix that. I tried changing the extension of the fasta file (fna and fasta) but got the same error either way.
Any ideas?
Also, I'm interested in benchmarking runtime, and I don't need per-gene info. What is the fastest way to get a list of predicted SNVs out? It wasn't clear to me, what is the output file? I'm hoping to get a VCF to make comparison easy, so any advice on that would be helpful.
Thanks, ben