Open eric-jm-lang opened 3 years ago
+1
I have the same confusion!
Same here!
Maybe one could hack a training file with a bunch of neutral mutations
mutation score
M1M 1.0
F12F;L30L 1.0
G89G 1.0
Looks like it worked. Tested with separate test file with random mutations. Need to still validate with experimental data.
Prediction from Training File with neutral mutations:
mutation score prediction D36D;G142G 1.00000000 1.04933691 E145E;S128S 1.00000000 1.04933691 L19L;N152N 1.00000000 1.04933691 E237E;P12P 1.00000000 1.04933691
Prediction from Test File with random mutations:
mutation score prediction A9D;T27L 1.00000000 0.51061106 A124A;I3T 1.00000000 0.98425829 V258L;A211L 1.00000000 -0.28957328 A276R;K252E 1.00000000 1.15801334 E175E;F14A 1.00000000 1.18147123
Ignore my previous naive attempt. I re-read the paper and recalled @luoyunan used homologous sequences to train a bidirectional model on masked amino acid residues. I reviewed the ECNet and Dataset classes. The provided model can only process mutation-feature paired TSV files for training. Training on homologous sequences must be in a different code base.
Thank you for your input @meehljd. Hope @luoyunan can provide more information on how to do this.
Hello, In the ECNet paper, you built an unsupervised ECNet model that does not require DMS data for training. It uses the predicted probability of an amino acid at a position as a proxy for fitness. Is there a specific code for this unsupervised model? Or is it a question of using the current ECNet code to generate an unsupervised model by using a different input for
--train
? Could you please provide more details on how to build such an unsupervised model? Many thanks in advance