pplacer memory issue? - Githubissues

dscheah commented 4 years ago

Hello! I am using metagenome-atlas V2.3.0 on a cloud computing system. I recently ran into an issue where the pplacer package required more memory than was available in the cloud.

Error in rule classify: jobid: 16 output: genomes/taxonomy/gtdb/classify log: logs/taxonomy/gtdbtk/classify.txt, genomes/taxonomy/gtdb/gtdbtk.log (check log file(s) for error message) conda-env: /data/work/darrenc/20181025_Metagenomes/Enfield/En_P1_CO/20200201_databases/conda_envs/83610501 shell: $nomes/taxonomy/gtdb --out_dir genomes/taxonomy/gtdb --extension fasta --cpus 8 &> logs/taxonomy/gtdbtk/classify.txt (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

[2020-02-06 10:32:08] INFO: GTDB-Tk v1.0.2 [2020-02-06 10:32:08] INFO: gtdbtk classify --genome_dir genomes/genomes --align_dir genomes/taxonomy/gtdb --out_dir genomes/taxonomy/gtdb --extension fasta --cpus 8 [2020-02-06 10:32:08] INFO: Using GTDB-Tk reference data version r89: /data/work/darrenc/20181025_Metagenomes/Enfield/En_P1_CO/20200201_databases/GTDB-TK [2020-02-06 10:32:08] WARNING: pplacer requires ~113 GB of RAM to fully load the bacterial tree into memory. However, 49.45GB was detected. This may affect pplacer performance. [2020-02-06 10:32:08] INFO: Placing 16 bacterial genomes into reference tree with pplacer (be patient). [2020-02-06 10:44:49] ERROR: Controlled exit resulting from an unrecoverable error or warning.

I found that memory usage could be constrained by adding a --mmap-file flag to the pplacer command. Since I couldn't find any option for changing pplacer options in the config.yaml file, I decided to look into the genecatalog extension, where I noticed in the default_config.yaml file that memory usage can be constrained. I tried to execute a dry-run with the following code:

snakemake --use-conda -n --profile /home/darrenc/.config/snakemake/darrenc \ --snakefile /home/darrenc/genecatalog_atlas/Snakefile \ --directory /data/work/darrenc/20181025_Metagenomes/Enfield/En_P1_CO/20200201_run

But I have been receiving the following error as output:

KeyError in line 29 of /home/darrenc/genecatalog_atlas/Snakefile: 'database_dir' File "/home/darrenc/genecatalog_atlas/Snakefile", line 29, in <module>

Any ideas for working around the pplacer memory issue? Am I going about it the right way by using the genecatalog extension? I did find some mention of pplacer using extensive memory in an previous issue (Test atlas #260) but don't believe it was the main problem for that particular thread.

Any advice that you're able to give me will be greatly appreciated!

Thanks, Darren

SilasK commented 4 years ago

Hey @dscheah .

pplacer memory issue in atlas:

Yes, pplacer needs a lot of memory. I've read that the memory is somewhat proportional to the threads used. Can you try to decrease threads and increase memory to the maximum of your cluster system allows? the arguments can be defined in the config file with : threads and large_mem

cloud execution

You said you are using atlas in a cloud environment. Are you using the submission options that snakemake offers?

SilasK commented 4 years ago

Note to myself:

[ ] ad option to pass extra command line to gtdb.

SilasK commented 4 years ago

To the question : Is genecatalog_atlas extension a solution for pplacer memory issue?

The genecatalog extension will give you a gene catalog, will give you more or less the same as atlas run genecatalog this workflow doesn't depend on pplacer. You can get some taxonomic information from the annotateion but not to the same precission as sith the MAGs.

I suggest you trying to get the pplacer woring.

dscheah commented 4 years ago

Hi @SilasK, thanks for replying! Yes, I'd already changed the number of threads and memory before I carried out atlas run, but still received the insufficient memory error at the pplacer step.

Execution parameters

threads and memory (GB) for most jobs especially from BBtools, which are memory demanding

threads: 8 mem: 35

threads and memory for jobs needing high amount of memory. e.g GTDB-tk,checkm or assembly

large_mem: 35 large_threads: 8 assembly_threads: 8 assembly_memory: 35

No, I have not tried the Snakemake submission options so I'll try it out to see if it fixes the problem. However, if Snakemake execution in the cloud requires Kubernetes, wouldn't the atlas run command ceased to work from the beginning?

Thanks very much, Darren

SilasK commented 4 years ago

You only have 35gb of memory on your cloud system?, hmm

I fear it won't be easily be possible to use the taxonomic classification of gtdb. https://github.com/Ecogenomics/GTDBTk/issues/58

What you can do is to remove the gtdb annotation from the list of annotation in the config file. So you should be able to run atlas until the end. You can add the taxonomy based on checkm, but they also use pplacer, but with a smaller tree.

annotations:
  - genes
  - checkm_taxonomy
  - checkm_tree

dscheah commented 4 years ago

Thanks @SilasK, I've changed the config file accordingly and will let you know how it goes. Yes, unfortunately we only have 96gb of memory on our research group's cloud system, and half of that is already being utilised. On a side note, I read in the attached issue thread that --mmap-file was proposed as an option for limiting the amount of memory used by GTDB-tk; was that change able to be implemented?

dscheah commented 4 years ago

The pipeline managed to work when taxonomy annotations were changed to checkm. When I checked the genomes/checkm/taxonomy.tsv for the taxonomic data, there were only 16 genomes annotated (taxonomy.txt). Is there another file generated that has the complete metagenome? Or will I need to rerun the workflow with altered parameters?

Thanks very much for your help.

SilasK commented 4 years ago

Atlas is designed to assemble ned genomes from metagenomes de-novo for metagenomes where there are few available ref genomes (almost all metagenomes). Apparently, you only managed to assemble 16 genomes (species) with good quality for your dataset. You can check the genomes/genomes folder for the fasta file.

The genomes are quantified in all your samples and the results are in genomes/counts

What would you like more? Where do your samples come from?

dscheah commented 4 years ago

I would like to determine the dominant archaea is the system as well. My samples are groundwater and crude oil samples from a subsurface petroleum reservoir, so I am trying to uncover the microbial players and dynamics within that reservoir system, and the archaea are the methane producers in an anaerobic environment. I had also utilised MG-RAST to annotate my sequences, and the output showed that the archaea comprise less than 2% of total reads. I presume that such low proportions would not generate good quality reads for annotation?

Thanks for your reply!

SilasK commented 4 years ago

You won’t get enough data to get good genomes for the archaea. You can always look to the Genecatalog/annotations/EggNOG.tsv . Genes are annotated with a taxonomy. An you might also quantify the genes if you want using:

atlas run None "Genecatalog/counts/median_coverage.tsv.gz"

dscheah commented 4 years ago

If I understand correctly, the genes that were annotated with the de novo assembled genomes provide the predicted functions for each of the characterised species (quality assembled genome) in the sample. According to the Atlas documentation, these data are found in Genecatalog/gene_catalog.faa and Genecatalog/gene_catalog.fna? But they are essentially fasta files of gene sequences? Genecatalog/annotations/EggNOG.tsv contains all genes predicted from contig configuration, and hence do not have accurate association with particular species genomes?

Thanks again for your help!

SilasK commented 4 years ago

It depends on your configuration:

If you have:

genecatalog: Source: contigs

All genes from all contigs (including your Archaea) will be clustered and annotated.

If you have

genecatalog: Source: genomes

Only the genes on the 16 genomes were analysed.

In any case the file

genomes/annotations/gene2genome.tsv.gz

Tells you which genes are on which genome.

dscheah commented 4 years ago

The source configuration for genecatalog was set to contigs, so I changed it to source: genomes reran the command:

atlas run -w working/directory -c path/to/config.yaml genecatalog

However I received the following output:

Building DAG of jobs... Updating job 5 (combine_egg_nogg_annotations). Nothing to be done. Complete log: /data/work/darrenc/20181025_Metagenomes/Enfield/En_P1_CO/20200201_run/.snakemake/log/2020-02-13T034532.194439.snakemake.log ATLAS finished The last rule shows you the main output files

Would I need to rerun the whole pipeline to obtain the genome-specific gene catalog? Or do I create a new working directory?

Thanks very much!

SilasK commented 4 years ago

you have the genome-specific gene catalog in in genomes/annotations/gene2genome.tsv.gz

what do you want more :-) ? Do you have some python experience?

dscheah commented 4 years ago

Unfortunately I couldn't find the gene2genome.tsv.gz file; I also tried the find command within the working directory but nothing came up. Would it be under DC8-CD27KANXX-CAGATC-L008/annotation/predicted_genes ? However, there is no gene2genome.tsv.gz here, only the following files:

DC8-CD27KANXX-CAGATC-L008.faa
DC8-CD27KANXX-CAGATC-L008.faa.bacteria.scg
DC8-CD27KANXX-CAGATC-L008.gff DC8-CD27KANXX-CAGATC-L008.faa.archaea.scg
DC8-CD27KANXX-CAGATC-L008.fna
DC8-CD27KANXX-CAGATC-L008.tsv

So that's why I tried atlas run gene catalog because I thought that the genome-specific annotated genes had not been produced because source configuration for catalog had not been set to genomes. However, since the output came out as Nothing to be done, I assume that it must have been already run and I can't seem to find the file.

Unfortunately, my Python experience is quite limited. I'm more experienced with R and Unix.

Thanks!

SilasK commented 4 years ago

If you run the latest version of atlas (on conda)

with atlas run all

you should get the workingdir/genomes/annotations/gene2genome.tsv.gz the log should tell you that it was generated.

dscheah commented 4 years ago

I am currently using Atlas V2.3.0 on Conda, which I think is the latest version? This is the log that was generated after I ran atlas run genecatalog. Perhaps I'm wrong but it doesn't seem that the gene2genome file was generated?

SilasK commented 4 years ago

You need to run atlas run all to get the workingdir/genomes/annotations/gene2genome.tsv.gz

all because it integrates the genomes and the genecatalog.

dscheah commented 4 years ago

I reran atlas run all but the file did not have any information? I'm not sure where I went wrong... I had already changed the config for genecatalog to source: genomes.

Thanks for your help!

SilasK commented 4 years ago

Set it back to source: contigs , remove the "genomes/annotations/gene2genome.tsv.gz" and run atlas all again.

metagenome-atlas / atlas

pplacer memory issue? #280

pplacer memory issue in atlas:

cloud execution

Execution parameters

threads and memory (GB) for most jobs especially from BBtools, which are memory demanding

threads and memory for jobs needing high amount of memory. e.g GTDB-tk,checkm or assembly