only counts of 1 - Githubissues

ereyred commented 2 months ago

Hello!

I got MetaCerberus to work, I'm using the third (annotation) part of the pipeline, inputting combined prokaryotic, eukaryotic and viral gene predictions predicted with different programmes. The problem I have is that all counts outputs have values of 0 or 1. I want to use GAGE/pathview to see what pathways/groups are enriched but my counts seem to be normalised or something?

I did: metacerberus.py --protein /predictions-faa/ --hmm ALL --dir_out /dir-out/ --meta --class class-file.tsv

And my outputs, eg KOFam_all_KEGG_counts.tsv, look like this: ID 1_preds 2_preds 3_preds 4_preds 5_preds 6_preds J7_preds 8_preds 9_preds K00001 1 1 1 1 1 1 1 1 1 K00002 0 0 0 0 0 0 1 1 0 K00003 1 1 1 1 1 1 1 1 1 K00004 1 1 1 1 1 1 0 1 1 K00007 1 1 1 0 1 0 1 1 0 K00008 0 1 1 0 1 0 0 1 0 K00009 0 0 1 0 1 0 0 0 0 K00010 1 1 1 1 1 1 1 1 1

Can you help? Do you know why my counts aren't more than 1? I did no clustering or dereplication steps before gene prediction. Thanks! E

ereyred commented 2 months ago

Sorry I was unclear, this counts of 1 issue happens only in the combined section of the output. When I look at the individual sample KOFam counts they have lots of different values. So it's an issue of when the counts are combined.

decrevi commented 1 month ago

Hello!

I am looking into this, which file are you looking at for the "individual sample KOFam counts" ?

There is a filtering step in the pipeline that removes overlapping hits and reports only the best hits, which leads me to think that this is normal expected behavior. Many genes will not have multiple domain hits, which is dependent on your data... If you can provide more details, and maybe the sample files that you are running through MetaCerberus, that would help me track down any possible issues.

Thank you, -Jose

ereyred commented 1 month ago

Hi Jose! Thanks for getting back. I am using marine metatranscriptomic data, so there should be lots of hits to the same KOFams in each sample.

When I look at for example "/step_10-visualizeData/sample1/KOFam_all_KEGG_level-3.tsv" there are many hits to each KOFam. eg top 5 lines: Name Count Glycolysis / Gluconeogenesis [PATH:ko00010] 55 Pentose and glucuronate interconversions [PATH:ko00040] 37 Ascorbate and aldarate metabolism [PATH:ko00053] 21 Pyruvate metabolism [PATH:ko00620] 69 Glycerolipid metabolism [PATH:ko00561] 26

But when I look at the combined data, "/step_10-visualizeData/combined/counts_KOFam_all_KEGG.tsv" or "/step_10-visualizeData/pathview/KOFam_all_KEGG_counts.tsv", the counts are all values of 0 or 1 for all samples and all KOFams. eg: ID sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 K00001 0 0 0 0 0 1 0 1 1
K00002 1 0 0 0 0 0 0 0 0
K00003 1 1 1 1 1 1 1 1 1
K00004 1 1 1 1 1 1 1 1 1
K00007 0 0 0 0 0 0 0 0 0

I also wonder if "/step_10-visualizeData/combined/KOFam_all_KEGG_PCA.html" is built on the incorrect (counts of 0 or 1) data? The PCA plot looks right to me but the counts.tsv file generated in the same folder is incorrect so I'm unsure about using the PCA output.

I've attached 2/9 samples (in .csv format because it wouldn't let me upload .faa): All_RA1_preds.csv All_RA2_preds.csv

mpdoane2 commented 1 month ago

Hi, I wanted to follow up with this post as well. I am interested in CAZy, KEEG, and FOAM genes and am finding a similar output in which there are only 1's and 0's in the output. The file I am looking at is in the Step10Visulization directory under Combined. The files are the counts_CAZy.tsv, counts_KOFam_prokaryote_FOAM.tsv, counts_KOFam_prokaryote_KEGG.tsv.

The command used was: metacerberus.py --prodigal contigs/ --hmm CAZy, KOFam_prokaryote --dir_out out

Thank you Mike

chandnisidhu commented 3 weeks ago

Hi,

I would like to report the very same issue. Is it binary data?

Best,

Chandni

raw-lab commented 3 weeks ago

We need to know more about the data overall? Are they single bacterial genomes? We would expect single genomes to only have 1 hit for alot of these KOs not many. Reads or contigs from metagenomics or metatranscriptomics We would expect multiple hits.

raw-lab commented 3 weeks ago

Also, you can't run the class file at the beginning. Deseq2/gage/pathview are after metacerberus runs on the files. Please see notes on how to run deseq2/gage/pathview.

chandnisidhu commented 3 weeks ago

Hi, those were bacterial MAGs (n=200). We expect, for example, multiple hits for each CAZy class within a single MAG. However, all counts file showed only 0 and 1.

raw-lab commented 3 weeks ago

We are working on a potential bug. Can you give us your operating system, command your using, which version of metacerberus, and some of your example data.

decrevi commented 3 weeks ago

Hello, I have been able to track down this bug and am fixing it. It is a result of updates to the filtering algorithm that conflicted with a the parsing step from the initial MetaCerberus design. Some of the counting output files are also meant as rollup files to summarize pathway counts and not hit counts themselves, so I am also renaming some of these to make it more clear what the output files are to reduce confusion.

I will update once this is fixed. Thank you for bringing this to my attention and I apologize for any inconvenience from this issue. -Jose

chandnisidhu commented 3 weeks ago

Hi,

thank you for the information. I will wait for it.

Chandni

decrevi commented 1 week ago

Hello, MetaCerberus 1.4.0 is now live on Bioconda. It has some performance improvements as well as this counting issue fixed. I moved the output files to the "final" folder to hopefully make it easier to find them. I will close this for now, please feel free to reach out if there are any further issues or suggestions, and thank you for using MetaCerberus!

thank you, -Jose

raw-lab / MetaCerberus

only counts of 1 #20