waldronlab / curatedMetagenomicData

Curated Metagenomic Data of the Human Microbiome
https://waldronlab.io/curatedMetagenomicData
Artistic License 2.0
127 stars 28 forks source link

gene_families RPK or CPM? is it possible to use in regroup function? #256

Closed luzhang321 closed 3 years ago

luzhang321 commented 3 years ago

Hi:) I have a question related to the gene_families file recorded. For example, "2021-03-31.AsnicarF_2017.gene_families", is it the file with each gene family in the community in reads per kilobase (RPK) units or the file with "relative abundance or "copies per million" (CPM)" [ the one got from the command humann_renorm_table --input demo_fastq/demo_genefamilies.tsv --output demo_fastq/demo_genefamilies-cpm.tsv --units cpm --update-snames] ? I also am wondering if it is possible that I use the genefamilies file from cMD to regroup to other functional categories(eg, ecs). [by using humann_regroup_table function?]

And is there a specific reason that you use relative abundance rather than CPM in your result?

Thanks in advance!

schifferl commented 3 years ago

Hi @luzhang321, all gene_families data are in CPM units, see here. As for regrouping other functional categories, you would have to experiment on your own. The curatedMetagenomicData R/Bioconductor package provides data that is highly processed in a specific way, and seeks to be ultra-consistent so users don't have to worry about the minutiae of data processing. You are always free to experiment with the Nextflow pipeline or the R pipeline on your own.