Open rhysnewell opened 1 year ago
Hi @rhysnewell,
Again I really apologize for this, but we haven't yet been able to recover the ordering of the GO annotation vector or retrain the model. If you have the time to do some investigation, maybe you can feed the model with many protein sequences for which you know the correct GO annotations and try to use it to recover which bit corresponds to which GO annotation (I'm not 100% sure it will work, but maybe it's worth trying before resorting to retraining the whole thing..)
You should think about these 0-1 numbers as probabilities. A value of 0.1 would indicate that the model assigns 10% probability for the protein having that annotation. Studying these values was not part of the paper, so I can't recommend any specific cutoffs (I think it's more dependent on what you use it for).
What do you mean when you say that the model is stringent?
I hope that at least somewhat helps (and again, really sorry about this loss of data..)
Thanks for clarifying. Unfortunately, I don't think that would work as there are just too many annotations to go over and determine one by one.
So in the paper how did you decide whether the model correctly predicted a GO annotation then? Did you not have a cutoff that was used (even if arbitrary like 95% or something)?
The stringency comment was just a mistake on my part I think. I passed it some more genes and it was providing much more sensible probabilities (i.e. > 99%).
Also, regarding the loss of data. Are you or collaborators not using your model to make their own predictions? Maybe someone you work with has access or has re-generated the model and kept the GO annotation vector.
Cheers, Rhys
Hi @nadavbra !
If it's any help, I'm considering some retraining myself and have access to my universities HPC cluster (with some very powerful GPUs) that could help speed up the process? Do you know if your co-author Dan has an updated version he'd like to retrain?
I'm happy to use my university's HPC resources to do so (which might speed up the training) and this might help recover the GO vector and also open the door to some slightly larger models (which you hypothesized in the paper could further increase it's power).
If you could put me in touch with Dan, I'd love to help out!
Hi @TheLostLambda We'd love to retrain proteinBert with improvements (a larger model, another convolution layer, and ideally, to remove the GO annotation as input (while keeping it as an output), and using an updated uniprot/uniref90 dump). Myself and Nadav don't have a ton of capacity, at least at the level of improving the input format (I can do the model hyperparameter architecture tweaks easily though). Would you be interested in collaabing on that?
Hi @ddofer ! I'll admit I'm somewhat new to machine learning as a field, but I do have a pretty solid computing background overall, so I'd be happy to give things a swing and try to help out! I'm happy to try to implement all of the improvements you've both already thought up and to train the model using my university's resources!
I'm definitely interested in collaborating and helping out however I can!
@ddofer @nadavbra Let me know if you'd like a meeting at some point to set things in motion!
@nadavbra Any progress with recovering the vector or retraining the model?
Hi @nadavbra,
Cool model, I just have a few questions. The first being, did you ever recover the GO annotation vector that was lost in https://github.com/nadavbra/protein_bert/issues/6? I can't seem to find it anywhere, and I unfortunately do not have a month nor a powerful enough GPU to retrain the model myself.
My second question is the
global_representations
vector that is returned when using the model to predict the function of a protein sequence is an array of floating point values between 0 an 1. Do you have any intuition on how to interpret these values? I get that values closer to one represent a higher probability of a given protein belonging to that GO annotation. But what is the cutoff? Is a value greater than 0.1 sufficient? Should it be higher, can you go lower? What values did you use in your paper? The model seems fairly stringent, is this by design?Cheers, Rhys