jlw-ecoevo / gRodon2

38 stars 3 forks source link

Varying results from MMv2 #7

Open evetb opened 1 year ago

evetb commented 1 year ago

Hello there!

Firstly, thanks for a fantastic tool!

Following the vignette, I have annotated protein sequences (output from prokka) of metagenomic data on which I ran the predictGrowth function, as follows:

predictGrowth(genes, highly_expressed, mode = "metagenome_v2", depth_of_coverage = depth_of_coverage)

I noticed that the CUB calculations changed every time I ran this command on each metagenomic sample I have. The calculations stayed the same for MMv1. Upon looking more in detail on how MMv2 works, I assume this is because it randomly chooses 100 non-highly-expressed genes to calculate background CUB, which thus affects the CUB calculations.

My question is, how do I choose which value represents the "true" CUB value of each metagenome? For example, should I run predictGrowth a certain number of times and take the median of all these values?

Please let me know if you need any information from me, and thank you in advance for your help!

jlw-ecoevo commented 1 year ago

Hi! Thanks for your interest in gRodon.

Your explanation is 100% correct. How much are they varying? In my experience these values should remain pretty stable. If you would like to decrease the variance among these runs further you can increase n_le in predict growth.

For example: predictGrowth(genes, highly_expressed, mode = "metagenome_v2", depth_of_coverage = depth_of_coverage, n_le = 500)

Although, this will increase the runtime of gRodon. Sampling many times and taking the average would give a similar result.

jlw-ecoevo commented 1 year ago

Out of curiosity, what is the range of ConsistencyHE values you get for these metagenomes?

evetb commented 1 year ago

Some of them are varying quite a lot, and others not as much. I haven't quantified this or done any statistics on it. I'm attaching here a graph of the output for each metagenome over time. The error bars are the upper & lower CI reported by gRodon. 2_V2 1_V2

evetb commented 1 year ago

Out of curiosity, what is the range of ConsistencyHE values you get for these metagenomes?

I will plot this and get back to you on it! Thank you for your extremely quick response, by the way, and for your solution to my query. I'll try increasing the n_le as you suggested.

evetb commented 1 year ago

Out of curiosity, what is the range of ConsistencyHE values you get for these metagenomes?

I will plot this and get back to you on it! Thank you for your extremely quick response, by the way, and for your solution to my query. I'll try increasing the n_le as you suggested. 1_CHE 2_CHE

It looks like the ConsistencyHE values are consistent. Here they are as plots - no difference between the different trial runs I did.

evetb commented 1 year ago

I figured perhaps the raw data may be more informative - here it is! grodon_vals_1-2_v2.xlsx

jlw-ecoevo commented 1 year ago

Yes, these data suggest that the issue is probably n_le (notice in that data the wider variance in CUB for the low-expressed genes than CUBHE for the high-expressed genes). I think the issue is likely arising as a byproduct of high consistency values, meaning that the preferred codons of the organsisms in your sample are very different from one another. This makes it harder to get a good background CUB (we need to sample more genes for it to stabilize). How many genes overall are there in each sample?

This is a good behavior to be aware of, thanks for bringing my attention to it. I'll set aside some time to investigate myself and update the guidance in the vignette (which needs an update anyway).

evetb commented 1 year ago

Thank you for your insight!

I'm honestly unsure how to check the number of genes in each sample. Would you have any recommendations?

jlw-ecoevo commented 1 year ago

Presumably you have a fasta file with all of the genes annotated for each sample? Then on the command line you could run:

grep ">" my_sample.fasta | wc -l

Let me know also if increasing n_le helps when you get around to it (you might have to increase higher than 500, I was asking about total # genes to get a sense of how big you might have to go)

evetb commented 1 year ago

Thank you again for your help! I have attached here files that contain the number of genes per sample - each line is a different sample. genes_1.txt genes_2.txt

jlw-ecoevo commented 1 year ago

ok so it's around 100k genes per sample, so you might need to push n_le pretty high if theres a lot of variance across these genes. I don't actually have a good sense how performance dips depending on the number of genes sampled (I think it should be roughly linear, so if you let n_le=10000 it should take around 100x as long to run then as with default settings, n_le=1000 should take 10x as long). I'll need to do some experiments with this myself at some point, but my advice is to start out with 1k and see how much variance decreases and how much runtime increases

evetb commented 1 year ago

Great, I'll try that! How would I get a sense of the variance? I suppose I would need to run these samples many times?

jlw-ecoevo commented 1 year ago

maybe pick one day and run it 5 times at n_le=100,500,1000 and see how much the variance decreases just for that sample, should give you a good overall idea. It looks like sample 2 day 11 and sample 1 day 7 might be good candidates as these had high variance in dCUB for n_le=100 (the default) in the data you sent me

evetb commented 1 year ago

Hi again! I have done as you recommended, and here is the data. It seems like the runtimes are rather variable, which I feel could be in part due to whatever else I was running on my computer at the same time as gRodon. So I'm not sure if these runtimes are a good measure of how long gRodon actually takes to run at different nle levels for my samples. grodon_vals_1-2_trials_100-1k.xlsx

evetb commented 1 year ago

My bad, I mixed up the two samples. This is the correct spreadsheet. Also, runtime is in seconds. grodon_vals_1-2_trials_100-1k.xlsx