raw-lab / MetaCerberus

Python code for versatile Functional Ontology Assignments for Metagenomes searching via Hidden Markov Model (HMM) with environmental focus of shotgun metaomics data
BSD 3-Clause "New" or "Revised" License
46 stars 7 forks source link

Class file and GAGE and PathView analysis #8

Closed alegarritano closed 6 months ago

alegarritano commented 6 months ago

Hi Richard,

Once again, thanks for developing the pipeline! Great, great work.

I have been trying to run it in a set of 5 genomes to check for possible pathway enrichments, but the pipeline finishes without generating these results.

This is the command that I am using: metacerberus.py --protein FAA_files --hmm COG --dir_out ./COG

And this is the stderr:

Starting MetaCerberus Pipeline

Starting MetaCerberus Pipeline

Checking for external dependencies: fastqc /miniconda3/envs/metacerberus/bin/fastqc flash2 /miniconda3/envs/metacerberus/bin/flash2 fastp /miniconda3/envs/metacerberus/bin/fastp porechop /miniconda3/envs/metacerberus/bin/porechop bbduk.sh /miniconda3/envs/metacerberus/bin/bbduk.sh FragGeneScanRs /miniconda3/envs/metacerberus/lib/python3.10/site-packages/meta_cerberus/FGS/FragGeneScanRs prodigal /miniconda3/envs/metacerberus/bin/prodigal prodigal-gv NOT FOUND, must be defined in config file as EXE_PRODIGAL-GV: phanotate.py NOT FOUND, must be defined in config file as EXE_PHANOTATE: hmmsearch /miniconda3/envs/metacerberus/bin/hmmsearch countAssembly.py /miniconda3/envs/metacerberus/bin/countAssembly.py Initializing RAY 2024-03-06 07:30:44,777 INFO worker.py:1612 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 Started RAY single node Running RAY on 1 node(s) Using 14 CPUs per node

STEP 1: Loading sequence files: Processing 0 fastq sequences Processing 0 fasta sequences Processing 6 protein sequences Processing 0 rollup files

STEP 8: HMMER Search

STEP 8: Filtering HMMER results

STEP 9: Parse HMMER results

STEP 10: Creating Reports Saving Statistics Creating Rollup Tables Creating Count Tables PCA Analysis Creating combined sunburst and bargraphs

Finished Pipeline

Finally, these are the files that the pipeline generates:

COG_Loading_Matrix.tsv COG_Loadings.tsv COG_PCA.html counts_COG.tsv img list.txt stats.html stats.tsv

If I understood the README correctly, I would need to provide a CLASS file in order to get the GAGE/pathview results, but what would be the structure of that file?

Thanks,

raw-lab commented 6 months ago

Thank you for using MetaCerberus. GAGE and Pathview R require KEGG KOs and currently don't function with only COGs. We are working on integrating our other tool SBGNview R into MetaCerberus. Give KEGG/FOAM KO databases a try. Then we can see if the class file isn't loading.

alegarritano commented 6 months ago

Hi,

I just ran it with the following parameters: metacerberus.py --protein FAA_files --hmm KOFam_all --dir_out ./KOFam

And these are the files that were generated:

FOAM_Loading_Matrix.tsv FOAM_Loadings.tsv FOAM_PCA.html KEGG_Loading_Matrix.tsv KEGG_Loadings.tsv KEGG_PCA.html counts_FOAM.tsv counts_KEGG.tsv img list.txt stats.html stats.tsv

No pathview folder was generated in the combined folder, neither I could find the class file anywhere. Am I missing something?

raw-lab commented 6 months ago

Thank you again for using MetaCerberus. You raise a good point here. As we need to include a better tutorial for DESeq2/EdgeR, GAGE, and pathview.

So, we are unable to automate comparisons as we are unsure the comparisons a research will want to make. The class file lists sample names and class (or grouping) or comparisions the researcher wants to make.

In a separate post, I will add examples within our results folder of a class and script for running the R related code. We are in process of converting the R code into python and removing the access to the internet requirements for the KEGG pathways.

raw-lab commented 6 months ago

example class file

github.com/raw-lab/metacerberus/results/rhizobium/23-06-01_rhizobium/step_10-visualizeData/combined/pathview/KEGG_class.tsv

example bash script for running R stats

github.com/raw-lab/metacerberus/results/rhizobium/23-06-01_rhizobium/step_10-visualizeData/combined/pathview/run_pathview.sh

Rscript to run pathview

bin/pathview-metacerberus.R

Let us know if this works for you? Also, if you have thoughts for making it more user friendly. I will make this into a tutorial. I think this will help.

alegarritano commented 6 months ago

Got it. I initially thought that the class.txt file was something else that was going to be generated by the pipeline, as I couldn't find its structure. All sorted, it's working like a charm. Thanks!

As a suggestion, I think it would probably make the heatmaps easier to interpret if instead of KO numbers, we get the name of the enzyme (e.g instead of K00027, we get "malate dehydrogenase").

raw-lab commented 6 months ago

Thats a fair and good point. We will take a look. Thank you again for using MetaCerberus. Also, if you want us to include your custom HMMs we can include them as separate database. And, then add them to the new FunGene in the future. Just send us an email.