vmaffei / dada2_to_picrust

Experimental pipeline to perform de novo PICRUSt on de-noised amplicon sequence variants (ASV)
19 stars 1 forks source link

format_tree_and_trait_table.py memory issue #10

Closed nebfield closed 6 years ago

nebfield commented 6 years ago

Hi,

Thanks for making this pipeline, it looks great! I'm having trouble with the following step in the README:

# format kegg IMG data
format_tree_and_trait_table.py -t ./genome_prediction/study_tree.tree -i gg_ko_counts.tab -o ./genome_prediction/format/KEGG/

I have a machine with 32GB of RAM and 10GB of swap but the job keeps getting killed because it runs out of memory. As a guideline what sort of memory do I need to run this command? I'm using the 97_otus.fasta file and am only trying to run a few hundred sequences from dada2 through picrust.

Thanks, Ben

vmaffei commented 6 years ago

Hi nebfield, thanks for posting! I think the easiest way around this will be to predict ASV functions using the original PICRUSt KEGG reference table. This should reduce the memory requirement at the format_tree_and_trait_table.py step. Running the pipeline this way reduces Spearman rho accuracy only slightly. Frankly, it may make more sense to list this as the default run option given the diverse memory issues posted on the repo thus far. When you get a chance, replace Part 0 and 1 with the steps listed below, and let me know how it goes!

Run DADA2 -> PICRUSt using picrust starting files

Collect input files:

http://greengenes.secondgenome.com/downloads/database/13_5

Part 0

# move files to genome_prediction/
mv gg_13_5.fasta genome_prediction/
mv gg_13_5_img.txt genome_prediction/
# unzip starting files
cp picrust/tutorials/picrust_starting_files.zip genome_prediction/
cd genome_prediction/
unzip picrust_starting_files.zip

Part 1: Start with R

# Dependencies
library(ShortRead)
library(biom) # use Joey's biom latest dev version; library(devtools); install_github("joey711/biom")
library(dada2)
# 1) Make study db
# read in gg 13_5 db
ref <- readFasta("genome_prediction/gg_13_5.fasta")
# read in IMG to ko
img_ko <- read.table(file = "genome_prediction/picrust_starting_files/IMG_ko_counts.tab", sep = "\t", row.names = 1, header = TRUE, comment.char = '')
# read in gg id to IMG
id_img <- read.table(file = "genome_prediction/gg_13_5_img.txt", sep = "\t", header = TRUE, comment.char = '')
# load study seqs and table
load(file = "seqtab.nochim.robj")
# intersect gg id and ko list
img_ko_sub <- img_ko[rownames(img_ko) %in% id_img$img_genome_id, ]
id_img_sub <- id_img[id_img$img_genome_id %in% rownames(img_ko_sub), ]
# pull db ids
ids <- id(ref)
# subset db for gg w/ ko
ref_sub <- ref[as.character(ids) %in% id_img_sub$gg_id, ]
# build study db
ids_db <- as.character(id(ref_sub))
seqs_db <- as.character(sread(ref_sub))
# grab study seqs
seqs_study <- colnames(seqtab.nochim)
ids_study <- paste("study", 1:ncol(seqtab.nochim), sep = "_")
# merge db and study seqs
ids_out <- c(ids_db, ids_study)
seqs_out <- c(seqs_db, seqs_study)
fasta <- ShortRead(sread = DNAStringSet(seqs_out), id = BStringSet(ids_out))
writeFasta(fasta, file = "genome_prediction/gg_13_5_study_db.fasta")
# 2) output seq variant count data as biom;
seqtab_biom <- t(seqtab.nochim)
rownames(seqtab_biom) <- ids_study
biom_object <- biom::make_biom(data = seqtab_biom)
biom::write_biom(biom_object, biom_file = "genome_prediction/sample_counts.biom")

Continue with Parts 2-4 as usual

note: replace Part 3 w/ 5 if another memory error occurs

nebfield commented 6 years ago

Thanks for the help! I'm getting a different error now from the same command:

format_tree_and_trait_table.py -t study_tree.tree -i gg_16S_counts.tab -o $PWD
RuntimeError: filter_tree_tips_by_presence_in_table:  operation would remove all tips.  Is this due to a formatting error in inputs?
vmaffei commented 6 years ago

Hey nebfield, give this a shot instead. Replace both format_tree_and_trait_table steps in Part 3 with:

# format 16S copy number data
format_tree_and_trait_table.py -t ./genome_prediction/study_tree.tree -i ./genome_prediction/picrust_starting_files/IMG_16S_counts.tab -o ./genome_prediction/format/16S/ -m ./genome_prediction/gg_13_5_img.txt
# format kegg IMG data
format_tree_and_trait_table.py -t ./genome_prediction/study_tree.tree -i ./genome_prediction/picrust_starting_files/IMG_ko_counts.tab -o ./genome_prediction/format/KEGG/ -m ./genome_prediction/gg_13_5_img.txt
nebfield commented 6 years ago

Hi - still getting the same error.

format_tree_and_trait_table.py -t study_tree.tree -i IMG_16S_counts.tab -o $PWD -m gg_13_5_img.txt
RuntimeError: filter_tree_tips_by_presence_in_table:  operation would remove all tips.  Is this due to a formatting error in inputs?
vmaffei commented 6 years ago

Mind if I take a look at your study_tree.tree file? When you get a chance, gzip it and attach it in a post.

nebfield commented 6 years ago

Sorry for the delay. Thanks!

vmaffei commented 6 years ago

Thanks! It looks like the greengenes sequences aren't making it into study_tree.tree When you run the revised Part 0 - 1 above, do you run into any errors? When you get a chance, double check that gg_13_5_study_db.fasta consists of both study seqs and ~33,000 greengenes seqs. Let me know what you find!

nebfield commented 6 years ago

I found the problem:

ref_sub <- ref[as.character(ids) %in% id_img_sub$gg_id, ]

ref_sub was silently being set to null. The first character in my gg_13_5_img.txt file was a hash, which made R rename id_img_sub$gg_id to id_img_sub$X.gg_id.

I'm running through the rest of the pipeline with no issues so far. Thank you for all your help, I really appreciate it!