Open vijuvi opened 5 months ago
Thanks for the very detailed description of your experimental setup. One perhaps easier thing you could do that goes in the direction of what you describe would be to run a lasso regression, by setting the elastic net with --alpha 1
. This will result in many (most?) variants getting a weight of 0, which is similar to the logic you were applying. Does that make sense to you?
Thank you! I think the actual effect on the stress resilience phenotype is caused by bigger networks of genes, e.g. regulons and such. Will I not lose a lot of these low effect variants when using lasso regression?
You could try fitting a number of models varying the --alpha
parameter so that your "lasso" has a variable size? To reduce the time it takes to run each command you can save the variants in the first pass using --save-vars
and then use --load-vars
for subsequent runs
Thanks, I hope I grasped the general idea! So I'll try to fit a model with a maximal alpha value so that the predictions are still accurate enough.
pyseer --kmers ALL_UNITIGS --phenotypes ALL_PHENOTYPES --wg enet \
--lineage-clusters ALL_LINEAGES --sequence-reweighting \
--save-vars SAVED_VARS --save-model ALL_EVERYTHING \
--alpha 1 > lasso_selected_unitigs
ALL_UNITIGS
or COMBINATION_OF_VARIANTS_IM_INTERESTED_IN
to train the model I use to predict phenotypes? Also, is it the alpha value in this training part that I should vary?pyseer --kmers WHICH_UNITIGS? --phenotypes TRAINING_SET_PHENOTYPES \
--wg enet --load-vars SAVED_VARS --alpha 1 --save-model TEST_LASSO \
--lineage-clusters ALL_LINEAGES --sequence-reweighting
COMBINATION_OF_VARIANTS_IM_INTERESTED_IN
to test if predicting with just these unitigs hitting the interesting genes are enough to produce accurate results.enet_predict --kmers COMBINATION_OF_VARIANTS_IM_INTERESTED_IN \
--lineage-clusters ALL_LINEAGES --true-values TEST_SET_PHENOTYPES \
TEST_LASSO.pkl TEST_SAMPLES > test_predictions.txt
Is this even close to correct usage?
Another question that just popped into my mind is why is --cor-filter 0
adviced in the best practices part of the documentation?
I would use all unitigs in your subsequent analysis, although many will not be used for the predictions anyway because they will have a weight of 0. I fear that providing a smaller variant set might throw some errors (I'm not actually sure though, woud need to check the code and tests again). I would also follow the advice of not dropping correlated variants, unless you run into out-of-memory and models that take too long to train.
I tried to train the model with my data in both ways .
This works fine:
pyseer --kmers ALL_UNITIGS --phenotypes TRAINING_SET_PHENOTYPES \
--wg enet --load-vars SAVED_VARS --alpha 1 --save-model TEST_LASSO_ALL_UNITIGS \
--lineage-clusters ALL_LINEAGES --sequence-reweighting
This throws an error:
pyseer --kmers COMBINATION_OF_VARIANTS_IM_INTERESTED_IN --phenotypes TRAINING_SET_PHENOTYPES \
--wg enet --load-vars SAVED_VARS --alpha 1 --save-model TEST_LASSO_FEWER_UNITIGS \
--lineage-clusters ALL_LINEAGES --sequence-reweighting
Error message:
Traceback (most recent call last):
File "~/miniforge3/envs/pyseer/bin/pyseer", line 10, in <module>
sys.exit(main())
File "~/miniforge3/envs/pyseer/lib/python3.10/site-packages/pyseer/__main__.py", line 688, in main
print(format_output(x,
File "~/miniforge3/envs/pyseer/lib/python3.10/site-packages/pyseer/utils.py", line 59, in format_output
out += '\t' + '\t'.join(['%.2E' % Decimal(x)
File "~/miniforge3/envs/pyseer/lib/python3.10/site-packages/pyseer/utils.py", line 60, in <listcomp>
if np.isfinite(x)
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
but when I remove the --load-vars
, this option also works fine.
Now I have two models TEST_LASSO_ALL_UNITIGS
and TEST_LASSO_FEWER_UNITIGS
. When I use these to predict the phenotypes of my test group
enet_predict_pyseer --kmers COMBINATION_OF_VARIANTS_IM_INTERESTED_IN \
--lineage-clusters ALL_LINEAGES --true-values TEST_SET_PHENOTYPES <model> TEST_SAMPLES
the R2-values are 0.7405 and 0.8494, respectively. So, the model works better with fewer unitigs, but I'm not so sure about the logic behind it. Am I predicting the right thing with the right model?
I don't think there's any particular error in your workflow, apart from the fact that when you use the TEST_LASSO_ALL_UNITIGS
model it's only seeing a subset of the unitigs, which I'm guessing would affect its performance, as it will probably (again, not sure without looking at the code) assume they are always missing. So perhaps a like-for-like comparison would be to test the TEST_LASSO_ALL_UNITIGS
model loading all unitigs. Does that make sense?
Why do elastic net and LMM select different variants?
Well, the two processes work quite differently; the main difference being that the LMM tests one feature at a time, while the elastic net uses all variants at the same time. That by itself will lead to different variants being flagged as associated with the phenotype.
Hello!
I got to analysing after clearing the startup hiccups of #252.
A succinct summary of the original data set:
I anticipate that these stress resilience phenotypes are under much weaker selection and I expect numerous smaller effects rather than a few big ones. I thought that elastic nets could be used to find these smaller combinations or networks of variants that most adequately predict the phenotype of a test group.
My thought process so far:
The problem is that this is slow and the number of possible combinations and subcombinations is astronomical. One alternative would be to first detect biologically relevant combinations, but then I fear I would lose genes of yet unknown function in the process.
Can pyseer elastic nets be used like this? Does this make any sense?