picrust / picrust2

Code, unit tests, and tutorials for running PICRUSt2
GNU General Public License v3.0
327 stars 104 forks source link

How to update the description and mapping file's latest version in picrust2? #240

Closed zzalzzu closed 2 years ago

zzalzzu commented 2 years ago

Hello @gavinmdouglas Thanks for offering picrust2

I found that the default files in picrust2 (description, mapping file, etc.) are not the latest version. I want to check the pathway(KO, KEGG module, KEGG pathway, EC, and metacyc) by matching our data. So I wonder how to get the latest version of the files and apply them to picrust2.

Thanks for reading. please answer my question.

gavinmdouglas commented 2 years ago

Hey @zzalzzu,

The KEGG mapping files cannot be distributed except for the last open-source version, but you can get them from the KEGG API using commands like this:

wget http://rest.kegg.jp/list/pathway
mv pathway KEGG_pathway_descrip.tsv

wget http://rest.kegg.jp/link/pathway/genome
mv genome KEGG_genome_pathway_links.tsv

wget http://rest.kegg.jp/list/genome
mv genome KEGG_genome_descrip.tsv

wget http://rest.kegg.jp/list/module
mv module KEGG_module_descrip.tsv

wget http://rest.kegg.jp/link/module/genome
mv genome KEGG_genome_module_links.tsv

wget http://rest.kegg.jp/list/ko
mv ko KEGG_ko_descrip.tsv

wget http://rest.kegg.jp/link/module/ko
mv ko KEGG_ko_module_links.tsv

wget http://rest.kegg.jp/link/pathway/ko
mv ko KEGG_ko_pathway_links.tsv

You would need to download EC number information from here I believe: https://enzyme.expasy.org I don't know if there's an easier place to get it from

Last, the MetaCyc information was taken from the parsed files created for HUMAnN2. I'm not sure what precise workflow they used to create the reaction to pathway mapping files, which makes it harder to create these with newer ones. However, you could check the latest version of HUMAnN3 for these files and/or look on the MetaCyc website (where you can definitely find pathway descriptions at least).

Cheers,

Gavin

zzalzzu commented 2 years ago

Thank you so much for your help!

zzalzzu commented 2 years ago

I updated the latest description and mapping file, and when I ran it, this error occurred.

" Stopping, because no pathways were identified. This can especially happen when either a test input file with few gene families is input or when gene family regrouping is not done properly. "

Perhaps I think that these three files do not match the current version, so the problem is probably caused.

prokaryotic/16S.txt.gz prokaryotic/ko.txt.gz prokaryotic/ec.txt.gz

Is there any way to get the latest version of these three files? or is there any way to fix the above error?

Sorry for the frequent question. Please reply once more.

gavinmdouglas commented 2 years ago

Hi @zzalzzu,

Just to clarify - you updated the MetaCyc pathway mapfiles?

Could you paste the first few lines of the new mapfile if so?

You don't want to replace those three files you indicated unless you have a different genome database that you want to use, which would require changing all of the files, including the 16S alignment and tree file.

Cheers,

Gavin

zzalzzu commented 2 years ago

Screen Shot 2022-03-07 at 6 00 29 PM

Hi! @gavinmdouglas thanks for replying to my question

Currently, I am looking for a map and description file for metacyc. So I didn't try to apply the new file for metacyc.

KEGG's module and pathway map file were obtained through the path you provided. However, the obtained map files were not sorted, so I applied after sorting by referring to the default file of picrust2. The sorted file looks like the attached picture.

After sorting and applying the map file, the same problem as below occurred.

" Stopping, because no pathways were identified. This can especially happen when either a test input file with few gene families is input or when gene family regrouping is not done properly. "

How can I solve this?

gavinmdouglas commented 2 years ago

Sorting the file shouldn't matter.

It looks like that mayflie should work. What command did you run? Make sure it matches the command in this FAQ post: https://github.com/picrust/picrust2/wiki/Frequently-Asked-Questions#how-can-i-determine-kegg-pathway-abundances-from-the-predicted-ko-abundances (including the --no_regroup option).

Gavin

yakshiUPR commented 2 years ago

Dear Gavin,

I wanted to follow up in this discussion, and ask something related. I would like to use the newest Kegg version because their newest update includes several KOs that might be of importance to my study system. I was able to download the newest files from the KEGG API, following your instructions above. However, I noticed that inside the prokaryotic folder, the file called "ko.txt" is a table that links the species to the KOs, and there the newest KOs are not included. Can I do anything to update this? If not, I don't see how updating the kegg files will allow picrust2 to generate an output different to the output with the default files.

Thanks, Yakshi

gavinmdouglas commented 2 years ago

Hey @yakshiUPR,

There's no easy way to update those links without re-annotating the genomes yourself and producing a new file. There are pathways that are totally missing in the older version of KEGG though, so those would be picked up without adding new KOs though, but you're right that the missing KOs could definitely mean that certain pathways are less likely to be called as present. In addition, it's important to realize that some KOs change definition between versions too. This should just be a small minority, but it will definitely add some noise to mix KEGG versions.

Sorry I can't be of more help!

All the best,

Gavin