Seeking combined effects of variants on a continuous phenotype with pyseer elastic nets

vijuvi commented 5 months ago

Hello!

I got to analysing after clearing the startup hiccups of #252.

A succinct summary of the original data set:

About 380 Listeria monocytogenes strains were sequenced
The strains were cultured in different abiotic stresses, i.e. hurdles used in food hygiene to control microbial growth
"Stress resilience" of each strain in each stress was modeled as a continuous phenotype (an area-under-curve value of a spline fitted growth curve)
GWAS was performed with pyseer on kmers of the pangenome and this continuous phenotype, yielding some 400 unique gene or intergenic hits
Now I'm analysing the strains and variants with pyseer elastic nets

I anticipate that these stress resilience phenotypes are under much weaker selection and I expect numerous smaller effects rather than a few big ones. I thought that elastic nets could be used to find these smaller combinations or networks of variants that most adequately predict the phenotype of a test group.

My thought process so far:

I ran pyseer elastic nets on the whole data set, which yielded about a few thousand unique gene hits.
I then randomly selected a group, let's say in the hundreds, of these gene hits and ran elastic nets prediction on a test group using only the unitigs hitting these genes.
I took the top combinations by R2 value, and predicted the phenotypes using smaller subcombinations of the first combinations, trying to find a smaller combination of genes with an even better prediction of the test group phenotypes.

The problem is that this is slow and the number of possible combinations and subcombinations is astronomical. One alternative would be to first detect biologically relevant combinations, but then I fear I would lose genes of yet unknown function in the process.

Can pyseer elastic nets be used like this? Does this make any sense?

mgalardini commented 5 months ago

Thanks for the very detailed description of your experimental setup. One perhaps easier thing you could do that goes in the direction of what you describe would be to run a lasso regression, by setting the elastic net with --alpha 1. This will result in many (most?) variants getting a weight of 0, which is similar to the logic you were applying. Does that make sense to you?

vijuvi commented 5 months ago

Thank you! I think the actual effect on the stress resilience phenotype is caused by bigger networks of genes, e.g. regulons and such. Will I not lose a lot of these low effect variants when using lasso regression?

mgalardini commented 5 months ago

You could try fitting a number of models varying the --alpha parameter so that your "lasso" has a variable size? To reduce the time it takes to run each command you can save the variants in the first pass using --save-vars and then use --load-vars for subsequent runs

vijuvi commented 4 months ago

Thanks, I hope I grasped the general idea! So I'll try to fit a model with a maximal alpha value so that the predictions are still accurate enough.

I'll first fit the elastic net on all the strains and all the phenotypes so that I can use the SAVED_VARS in subsequent analyses.

pyseer --kmers ALL_UNITIGS --phenotypes ALL_PHENOTYPES --wg enet \
--lineage-clusters ALL_LINEAGES --sequence-reweighting \
--save-vars SAVED_VARS --save-model ALL_EVERYTHING \
--alpha 1 > lasso_selected_unitigs

But this next part is not quite clear. Do I use ALL_UNITIGS or COMBINATION_OF_VARIANTS_IM_INTERESTED_IN to train the model I use to predict phenotypes? Also, is it the alpha value in this training part that I should vary?

pyseer --kmers WHICH_UNITIGS? --phenotypes TRAINING_SET_PHENOTYPES \
--wg enet --load-vars SAVED_VARS --alpha 1 --save-model TEST_LASSO \
--lineage-clusters ALL_LINEAGES --sequence-reweighting

Then I get to predicting with this trained model. I think at this point I should use the COMBINATION_OF_VARIANTS_IM_INTERESTED_IN to test if predicting with just these unitigs hitting the interesting genes are enough to produce accurate results.

enet_predict --kmers COMBINATION_OF_VARIANTS_IM_INTERESTED_IN \
--lineage-clusters ALL_LINEAGES  --true-values TEST_SET_PHENOTYPES \ 
TEST_LASSO.pkl TEST_SAMPLES > test_predictions.txt

Is this even close to correct usage?

Another question that just popped into my mind is why is --cor-filter 0 adviced in the best practices part of the documentation?

mgalardini commented 4 months ago

I would use all unitigs in your subsequent analysis, although many will not be used for the predictions anyway because they will have a weight of 0. I fear that providing a smaller variant set might throw some errors (I'm not actually sure though, woud need to check the code and tests again). I would also follow the advice of not dropping correlated variants, unless you run into out-of-memory and models that take too long to train.

vijuvi commented 4 months ago

I tried to train the model with my data in both ways .

This works fine:

pyseer --kmers ALL_UNITIGS --phenotypes TRAINING_SET_PHENOTYPES \
--wg enet --load-vars SAVED_VARS --alpha 1 --save-model TEST_LASSO_ALL_UNITIGS \
--lineage-clusters ALL_LINEAGES --sequence-reweighting

This throws an error:

pyseer --kmers COMBINATION_OF_VARIANTS_IM_INTERESTED_IN --phenotypes TRAINING_SET_PHENOTYPES \
--wg enet --load-vars SAVED_VARS --alpha 1 --save-model TEST_LASSO_FEWER_UNITIGS \
--lineage-clusters ALL_LINEAGES --sequence-reweighting

Error message:

Traceback (most recent call last):
  File "~/miniforge3/envs/pyseer/bin/pyseer", line 10, in <module>
    sys.exit(main())
  File "~/miniforge3/envs/pyseer/lib/python3.10/site-packages/pyseer/__main__.py", line 688, in main
    print(format_output(x,
  File "~/miniforge3/envs/pyseer/lib/python3.10/site-packages/pyseer/utils.py", line 59, in format_output
    out += '\t' + '\t'.join(['%.2E' % Decimal(x)
  File "~/miniforge3/envs/pyseer/lib/python3.10/site-packages/pyseer/utils.py", line 60, in <listcomp>
    if np.isfinite(x)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

but when I remove the --load-vars, this option also works fine.

Now I have two models TEST_LASSO_ALL_UNITIGS and TEST_LASSO_FEWER_UNITIGS. When I use these to predict the phenotypes of my test group

enet_predict_pyseer --kmers COMBINATION_OF_VARIANTS_IM_INTERESTED_IN \ 
--lineage-clusters ALL_LINEAGES --true-values TEST_SET_PHENOTYPES <model> TEST_SAMPLES

the R2-values are 0.7405 and 0.8494, respectively. So, the model works better with fewer unitigs, but I'm not so sure about the logic behind it. Am I predicting the right thing with the right model?

mgalardini commented 4 months ago

I don't think there's any particular error in your workflow, apart from the fact that when you use the TEST_LASSO_ALL_UNITIGS model it's only seeing a subset of the unitigs, which I'm guessing would affect its performance, as it will probably (again, not sure without looking at the code) assume they are always missing. So perhaps a like-for-like comparison would be to test the TEST_LASSO_ALL_UNITIGS model loading all unitigs. Does that make sense?

vijuvi commented 3 months ago

Why do elastic net and LMM select different variants?

mgalardini commented 3 months ago

Well, the two processes work quite differently; the main difference being that the LMM tests one feature at a time, while the elastic net uses all variants at the same time. That by itself will lead to different variants being flagged as associated with the phenotype.

mgalardini / pyseer

Seeking combined effects of variants on a continuous phenotype with pyseer elastic nets #261