nadavbra / protein_bert

491 stars 102 forks source link

GO Annotation Vector #38

Open rhysnewell opened 1 year ago

rhysnewell commented 1 year ago

Hi @nadavbra,

Cool model, I just have a few questions. The first being, did you ever recover the GO annotation vector that was lost in https://github.com/nadavbra/protein_bert/issues/6? I can't seem to find it anywhere, and I unfortunately do not have a month nor a powerful enough GPU to retrain the model myself.

My second question is the global_representations vector that is returned when using the model to predict the function of a protein sequence is an array of floating point values between 0 an 1. Do you have any intuition on how to interpret these values? I get that values closer to one represent a higher probability of a given protein belonging to that GO annotation. But what is the cutoff? Is a value greater than 0.1 sufficient? Should it be higher, can you go lower? What values did you use in your paper? The model seems fairly stringent, is this by design?

Cheers, Rhys

nadavbra commented 1 year ago

Hi @rhysnewell,

Again I really apologize for this, but we haven't yet been able to recover the ordering of the GO annotation vector or retrain the model. If you have the time to do some investigation, maybe you can feed the model with many protein sequences for which you know the correct GO annotations and try to use it to recover which bit corresponds to which GO annotation (I'm not 100% sure it will work, but maybe it's worth trying before resorting to retraining the whole thing..)

You should think about these 0-1 numbers as probabilities. A value of 0.1 would indicate that the model assigns 10% probability for the protein having that annotation. Studying these values was not part of the paper, so I can't recommend any specific cutoffs (I think it's more dependent on what you use it for).

What do you mean when you say that the model is stringent?

I hope that at least somewhat helps (and again, really sorry about this loss of data..)

rhysnewell commented 1 year ago

Thanks for clarifying. Unfortunately, I don't think that would work as there are just too many annotations to go over and determine one by one.

So in the paper how did you decide whether the model correctly predicted a GO annotation then? Did you not have a cutoff that was used (even if arbitrary like 95% or something)?

The stringency comment was just a mistake on my part I think. I passed it some more genes and it was providing much more sensible probabilities (i.e. > 99%).

Also, regarding the loss of data. Are you or collaborators not using your model to make their own predictions? Maybe someone you work with has access or has re-generated the model and kept the GO annotation vector.

Cheers, Rhys

nadavbra commented 1 year ago
  1. Just to clarify, I didn't mean going over the GO annotations by hand, but rather writing a program to do it (but it still could be a lot of work, so I totally get you)
  2. We never directly assessed the predicted GO annotations in the paper. We just looked at the loss (cross entropy) which uses the continuous probabilities (so we never had to choose any cutoff).
  3. My co-author Dan is planning to retrain an updated version of the model at some point, but I'm not sure when it's going to happen.
TheLostLambda commented 1 year ago

Hi @nadavbra !

If it's any help, I'm considering some retraining myself and have access to my universities HPC cluster (with some very powerful GPUs) that could help speed up the process? Do you know if your co-author Dan has an updated version he'd like to retrain?

I'm happy to use my university's HPC resources to do so (which might speed up the training) and this might help recover the GO vector and also open the door to some slightly larger models (which you hypothesized in the paper could further increase it's power).

If you could put me in touch with Dan, I'd love to help out!

ddofer commented 1 year ago

Hi @TheLostLambda We'd love to retrain proteinBert with improvements (a larger model, another convolution layer, and ideally, to remove the GO annotation as input (while keeping it as an output), and using an updated uniprot/uniref90 dump). Myself and Nadav don't have a ton of capacity, at least at the level of improving the input format (I can do the model hyperparameter architecture tweaks easily though). Would you be interested in collaabing on that?

TheLostLambda commented 1 year ago

Hi @ddofer ! I'll admit I'm somewhat new to machine learning as a field, but I do have a pretty solid computing background overall, so I'd be happy to give things a swing and try to help out! I'm happy to try to implement all of the improvements you've both already thought up and to train the model using my university's resources!

I'm definitely interested in collaborating and helping out however I can!

TheLostLambda commented 1 year ago

@ddofer @nadavbra Let me know if you'd like a meeting at some point to set things in motion!

a-ill commented 8 months ago

@nadavbra Any progress with recovering the vector or retraining the model?