vmaffei / dada2_to_picrust

Experimental pipeline to perform de novo PICRUSt on de-noised amplicon sequence variants (ASV)
19 stars 1 forks source link

Input to part 3 (format_tree_and_trait_table.py) #14

Closed fanli-gcb closed 6 years ago

fanli-gcb commented 6 years ago

The documentation has this command at the start of Part 3:

# format 16S copy number data
format_tree_and_trait_table.py -t ./genome_prediction/study_tree.tree -i gg_16S_counts.tab -o ./genome_prediction/format/16S/

If I understand correctly, the input file is gg_16S_counts.tab, which only contains the Greengenes reference OTUs and not any from your specific study. I'm following through the rest of Part 3 and it seems like the trait_table.tab and output from predict_traits.py would not have any of the study-specific OTUs. Is this the intended result?

vmaffei commented 6 years ago

Hi fanli, thanks for posting! There are two input files for this command (although only one of them takes the -i option): both gg_16S_counts.tab and study_tree.tree. This script simply prepares the study-specific tree and 16S count table data for the copy number predictions that will (eventually) be made in the predict traits step. The same occurs in the subsequent step in Part 3 where format_tree_and_trait_table.py prepares the remaining KEGG count data and the study-specific tree for predictions on those genes.

vmaffei commented 6 years ago

To answer your question: the final 16S (and KEGG) trait tables 16S_precalculated.tab and ko_precalculated.tab, which come out of predict_traits.py will contain a mixture of both study ids and greengenes OTU ids (if the -l flag is omitted). However, the trait_table.tab, which comes out of format_tree_and_trait_table.py won't have study specific ids, but will have greengenes ids.

fanli-gcb commented 6 years ago

Thanks for getting back so quickly @vmaffei

That makes more sense now - I think I have a better understanding of the entire process now. This came about because I keep getting memory errors when running predict_traits for the 16S table. But the batched KEGG part seems to work fine. I'll keep playing around with it and update this thread.

Thanks again!

vmaffei commented 6 years ago

No problem! Please let me know if you have any more questions (or further problems!) Happy to help you if you run into any hang-ups.

fanli-gcb commented 6 years ago

Hmm, still getting a memory error. The machine has 128GB memory, and I was able to do the batched KEGG predict_traits just fine.

> predict_traits.py -i ${PROJECT_DIR}/format/16S/trait_table.tab \
-t ${PROJECT_DIR}/format/16S/reference_tree.newick \
-r ${PROJECT_DIR}/asr/16S_asr_counts.tab \
-o ${PROJECT_DIR}/predict_traits/16S_precalculated.tab \
-a -c ${PROJECT_DIR}/asr/asr_ci_16S.tab \
-l ${PROJECT_DIR}/sample_counts.tab

yields:

Traceback (most recent call last):
  File "/home/fanli/miniconda3/envs/python27/bin/predict_traits.py", line 451, in <module>
    main()
  File "/home/fanli/miniconda3/envs/python27/bin/predict_traits.py", line 280, in main
    trait_label = trait_label, verbose=opts.verbose)
  File "/home/fanli/miniconda3/envs/python27/lib/python2.7/site-packages/picrust/predict_traits.py", line 735, in calc_nearest_sequenced_taxon_index
    tree.tipToTipDistances(annot_plus_to_predict)
  File "build/bdist.linux-x86_64/egg/cogent/core/tree.py", line 2065, in tipToTipDistances
MemoryError

My environment looks like this:

> python --version
Python 2.7.12 :: Anaconda custom (64-bit)
> predict_traits.py --version
Version: predict_traits.py 1.1.1

As far as I can tell, the input files look correct. The first few lines of trait_table.tab:

taxon_oid       16S_rRNA_Count
370251  5.0
4473812 2.0
3209311 3.0
4473818 4.0

reference_tree.newick is a ~4.7MB file that appears to be in proper Newick format.

First few lines of 16S_asr_counts.tab:

nodes   16S_rRNA_Count
root    3.42829027427299
0.957   2.69738935893439
0.875   2.67872150777865
0.615   2.29705316752189
0.992   1.95908412344828

First few lines of asr_ci_16S.tab:

nodes   16S_rRNA_Count
root    3.2159|3.6407
0.957   2.4699|2.9249
0.875   2.4132|2.9442
0.615   1.9971|2.597
0.992   1.6474|2.2708

trait_table.tab, 16S_asr_counts.tab, and asr_ci_16S.tab all have the same number of lines (99322)

sample_counts.tab is the tab-delimited version of the BIOM file generated in Part 1 (mine has 4358 study-specific OTUs)

@vmaffei any suggestions?

Thanks!

vmaffei commented 6 years ago

Try rerunning predict_traits.py as follows:

predict_traits.py -i ${PROJECT_DIR}/format/16S/trait_table.tab \
-t ${PROJECT_DIR}/format/16S/reference_tree.newick \
-r ${PROJECT_DIR}/asr/16S_asr_counts.tab \
-o ${PROJECT_DIR}/predict_traits/16S_precalculated.tab \
-l ${PROJECT_DIR}/sample_counts.tab

This is the same as above minus the -a and -c options, which increase memory requirements quite a bit. Let me know if this works for you!

fanli-gcb commented 6 years ago

That worked (and ran in about 10 seconds)!

Any thoughts as to why the 16S version runs out of memory but the KEGG one doesn't? The two trait tables seem to have the same number of rows (99322 in my example), and if anything, even the batched KEGG version has 2000 columns whereas the 16S has 1.

vmaffei commented 6 years ago

Awesome! Glad to hear it! The 16S version that was causing the error was the one listed in Part 3. The -a and -c options are left on in Part 3 for both the KEGG and 16S steps (so likely both of these would have produced errors). Part 5 omits these options for both the 16S and KEGG predict_traits.py steps. The 16S command that worked is the one in Part 5 (not 3).

fanli-gcb commented 6 years ago

Oh got it. Didn't notice the differences. Thanks for all your help!