Closed dscheah closed 3 years ago
Hey @dscheah .
Yes, pplacer needs a lot of memory. I've read that the memory is somewhat proportional to the threads used. Can you try to decrease threads and increase memory to the maximum of your cluster system allows?
the arguments can be defined in the config file with :
threads
and large_mem
You said you are using atlas in a cloud environment. Are you using the submission options that snakemake offers?
Note to myself:
To the question : Is genecatalog_atlas extension a solution for pplacer memory issue?
The genecatalog extension will give you a gene catalog, will give you more or less the same as atlas run genecatalog
this workflow doesn't depend on pplacer. You can get some taxonomic information from the annotateion but not to the same precission as sith the MAGs.
I suggest you trying to get the pplacer woring.
Hi @SilasK, thanks for replying! Yes, I'd already changed the number of threads and memory before I carried out atlas run, but still received the insufficient memory error at the pplacer step.
threads: 8 mem: 35
large_mem: 35 large_threads: 8 assembly_threads: 8 assembly_memory: 35
No, I have not tried the Snakemake submission options so I'll try it out to see if it fixes the problem. However, if Snakemake execution in the cloud requires Kubernetes, wouldn't the atlas run command ceased to work from the beginning?
Thanks very much, Darren
You only have 35gb of memory on your cloud system?, hmm
I fear it won't be easily be possible to use the taxonomic classification of gtdb. https://github.com/Ecogenomics/GTDBTk/issues/58
What you can do is to remove the gtdb annotation from the list of annotation in the config file. So you should be able to run atlas until the end. You can add the taxonomy based on checkm, but they also use pplacer, but with a smaller tree.
annotations:
- genes
- checkm_taxonomy
- checkm_tree
Thanks @SilasK, I've changed the config file accordingly and will let you know how it goes. Yes, unfortunately we only have 96gb of memory on our research group's cloud system, and half of that is already being utilised. On a side note, I read in the attached issue thread that --mmap-file was proposed as an option for limiting the amount of memory used by GTDB-tk; was that change able to be implemented?
The pipeline managed to work when taxonomy annotations were changed to checkm. When I checked the genomes/checkm/taxonomy.tsv for the taxonomic data, there were only 16 genomes annotated (taxonomy.txt). Is there another file generated that has the complete metagenome? Or will I need to rerun the workflow with altered parameters?
Thanks very much for your help.
Atlas is designed to assemble ned genomes from metagenomes de-novo for metagenomes where there are few available ref genomes (almost all metagenomes). Apparently, you only managed to assemble 16 genomes (species) with good quality for your dataset. You can check the genomes/genomes folder for the fasta file.
The genomes are quantified in all your samples and the results are in genomes/counts
What would you like more? Where do your samples come from?
I would like to determine the dominant archaea is the system as well. My samples are groundwater and crude oil samples from a subsurface petroleum reservoir, so I am trying to uncover the microbial players and dynamics within that reservoir system, and the archaea are the methane producers in an anaerobic environment. I had also utilised MG-RAST to annotate my sequences, and the output showed that the archaea comprise less than 2% of total reads. I presume that such low proportions would not generate good quality reads for annotation?
Thanks for your reply!
You won’t get enough data to get good genomes for the archaea. You can always look to the Genecatalog/annotations/EggNOG.tsv . Genes are annotated with a taxonomy. An you might also quantify the genes if you want using:
atlas run None "Genecatalog/counts/median_coverage.tsv.gz"
If I understand correctly, the genes that were annotated with the de novo assembled genomes provide the predicted functions for each of the characterised species (quality assembled genome) in the sample. According to the Atlas documentation, these data are found in Genecatalog/gene_catalog.faa and Genecatalog/gene_catalog.fna? But they are essentially fasta files of gene sequences? Genecatalog/annotations/EggNOG.tsv contains all genes predicted from contig configuration, and hence do not have accurate association with particular species genomes?
Thanks again for your help!
It depends on your configuration:
If you have:
genecatalog: Source: contigs
All genes from all contigs (including your Archaea) will be clustered and annotated.
If you have
genecatalog: Source: genomes
Only the genes on the 16 genomes were analysed.
In any case the file
genomes/annotations/gene2genome.tsv.gz
Tells you which genes are on which genome.
The source configuration for genecatalog was set to contigs, so I changed it to source: genomes reran the command:
atlas run -w working/directory -c path/to/config.yaml genecatalog
However I received the following output:
Building DAG of jobs...
Updating job 5 (combine_egg_nogg_annotations).
Nothing to be done.
Complete log:
/data/work/darrenc/20181025_Metagenomes/Enfield/En_P1_CO/20200201_run/.snakemake/log/2020-02-13T034532.194439.snakemake.log
ATLAS finished
The last rule shows you the main output files
Would I need to rerun the whole pipeline to obtain the genome-specific gene catalog? Or do I create a new working directory?
Thanks very much!
you have the genome-specific gene catalog in in genomes/annotations/gene2genome.tsv.gz
what do you want more :-) ? Do you have some python experience?
Unfortunately I couldn't find the gene2genome.tsv.gz
file; I also tried the find
command within the working directory but nothing came up. Would it be under DC8-CD27KANXX-CAGATC-L008/annotation/predicted_genes
? However, there is no gene2genome.tsv.gz
here, only the following files:
DC8-CD27KANXX-CAGATC-L008.faa
DC8-CD27KANXX-CAGATC-L008.faa.bacteria.scg
DC8-CD27KANXX-CAGATC-L008.gff
DC8-CD27KANXX-CAGATC-L008.faa.archaea.scg
DC8-CD27KANXX-CAGATC-L008.fna
DC8-CD27KANXX-CAGATC-L008.tsv
So that's why I tried atlas run gene catalog
because I thought that the genome-specific annotated genes had not been produced because source configuration for catalog had not been set to genomes. However, since the output came out as Nothing to be done
, I assume that it must have been already run and I can't seem to find the file.
Unfortunately, my Python experience is quite limited. I'm more experienced with R and Unix.
Thanks!
If you run the latest version of atlas (on conda)
with atlas run all
you should get the workingdir/genomes/annotations/gene2genome.tsv.gz
the log should tell you that it was generated.
I am currently using Atlas V2.3.0 on Conda, which I think is the latest version? This is the log that was generated after I ran atlas run genecatalog
. Perhaps I'm wrong but it doesn't seem that the gene2genome file was generated?
You need to run atlas run all
to get the workingdir/genomes/annotations/gene2genome.tsv.gz
all because it integrates the genomes and the genecatalog.
I reran atlas run all
but the file did not have any information? I'm not sure where I went wrong... I had already changed the config for genecatalog to source: genomes.
Thanks for your help!
Set it back to source: contigs
, remove the "genomes/annotations/gene2genome.tsv.gz"
and run atlas all again.
Hello! I am using metagenome-atlas V2.3.0 on a cloud computing system. I recently ran into an issue where the pplacer package required more memory than was available in the cloud.
Error in rule classify: jobid: 16 output: genomes/taxonomy/gtdb/classify log: logs/taxonomy/gtdbtk/classify.txt, genomes/taxonomy/gtdb/gtdbtk.log (check log file(s) for error message) conda-env: /data/work/darrenc/20181025_Metagenomes/Enfield/En_P1_CO/20200201_databases/conda_envs/83610501 shell: $nomes/taxonomy/gtdb --out_dir genomes/taxonomy/gtdb --extension fasta --cpus 8 &> logs/taxonomy/gtdbtk/classify.txt (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
[2020-02-06 10:32:08] INFO: GTDB-Tk v1.0.2 [2020-02-06 10:32:08] INFO: gtdbtk classify --genome_dir genomes/genomes --align_dir genomes/taxonomy/gtdb --out_dir genomes/taxonomy/gtdb --extension fasta --cpus 8 [2020-02-06 10:32:08] INFO: Using GTDB-Tk reference data version r89: /data/work/darrenc/20181025_Metagenomes/Enfield/En_P1_CO/20200201_databases/GTDB-TK [2020-02-06 10:32:08] WARNING: pplacer requires ~113 GB of RAM to fully load the bacterial tree into memory. However, 49.45GB was detected. This may affect pplacer performance. [2020-02-06 10:32:08] INFO: Placing 16 bacterial genomes into reference tree with pplacer (be patient). [2020-02-06 10:44:49] ERROR: Controlled exit resulting from an unrecoverable error or warning.
I found that memory usage could be constrained by adding a --mmap-file flag to the pplacer command. Since I couldn't find any option for changing pplacer options in the config.yaml file, I decided to look into the genecatalog extension, where I noticed in the default_config.yaml file that memory usage can be constrained. I tried to execute a dry-run with the following code:
snakemake --use-conda -n --profile /home/darrenc/.config/snakemake/darrenc \ --snakefile /home/darrenc/genecatalog_atlas/Snakefile \ --directory /data/work/darrenc/20181025_Metagenomes/Enfield/En_P1_CO/20200201_run
But I have been receiving the following error as output:
KeyError in line 29 of /home/darrenc/genecatalog_atlas/Snakefile: 'database_dir' File "/home/darrenc/genecatalog_atlas/Snakefile", line 29, in <module>
Any ideas for working around the pplacer memory issue? Am I going about it the right way by using the genecatalog extension? I did find some mention of pplacer using extensive memory in an previous issue (Test atlas #260) but don't believe it was the main problem for that particular thread.
Any advice that you're able to give me will be greatly appreciated!
Thanks, Darren