idmjky / EvolvePro

PLM based active learning model for protein engineering
Other
38 stars 3 forks source link

No improved functional mutants provided by the model #2

Closed jbsyngenes closed 1 month ago

jbsyngenes commented 1 month ago

round1_all_new.csv

Hi, I ran the model with the fasta and input excel file that you provided, and it worked! however, in the output file that I got, it looks like 7/10 top performers were mutants that were included in the sample dataset (attached the output I got for reference). Was this actual functional data? What should be done in a situation like this where the model doesn't suggest any better performers? I was expecting to see lots of new mutants with higher performance at the top of the list. Thanks!

idmjky commented 1 month ago

Hi, Thanks for your interest. The issue you are describing are due to the nature of Random Forest regressor which caps the prediction by the maximum value in the input training data. What you need to do is to simply take the top 10-16 of the untested variants in this list and go from there. I will update the code to directly output the non-tested variants in later versions. Hope this helps.

idmjky commented 1 month ago

The toplayer.py has been updated to only output the variants to test now.