linnabrown / run_dbcan

Run_dbcan V4, using genomes/metagenomes/proteomes of any assembled organisms (prokaryotes, fungi, plants, animals, viruses) to search for CAZymes.
http://bcb.unl.edu/dbCAN2
GNU General Public License v3.0
138 stars 40 forks source link

How to calculate CAZyme TPM for metagenomic data? #184

Open libby-natola opened 1 month ago

libby-natola commented 1 month ago

Hello dbcan devs!

Thanks for developing the dbCAN command line tools, and thanks for providing such helpful and detailed documentation and tutorials!

I'm using dbCAN to annotate CAZymes/CGCs/substrates within some eDNA metagenomic samples. I assembled the metagenomes on my own and annotated the prokaryotic contigs using run_dbcan like so:

run_dbcan $dir/contigs5000.proks.fasta meta -c cluster --dbcan_thread 24 --tf_cpu 24 --stp_cpu 24 --hmm_cpu 24 --dia_cpu 24 --cgc_substrate --out_dir $dbcan_dir --db_dir /mnt/Genomics/Working/databases/dbCAN/db

My ultimate goal is to have the normalized abundances for CAZymes, CGCs, and substrates in TPM. I'm trying to follow the steps in Module 3 of the metagenomic example in the user guide, hoping to end up at P13, which requires the depth.txt file (which is generated using the CDS.bam/CDS.sam files, which are generated from the .ffn file). However, I don't have the .ffn files specified in the read mapping step (P8), I suspect because I ran dbcan with the 'meta' tool and didn't end up using prokka, so I can't follow the instructions as they have been written. I understand for TPM I need the depth and length of each gene and the read depth of each sample, but I'm struggling to calculate the CAZyme gene depths and lengths without the .ffn files.

How do you suggest I go about calculating the TPM in this situation. Is there an alternate way to generate the .ffn file I need? Or perhaps I could manipulate some other output file to get the required data?

Thanks very much for any guidance you can provide!