Closed 1511878618 closed 1 year ago
Ah I see. Thanks for posting this. _x is created when only one sequence has a particular EC number. Since we need to sample positive sequences other than the anchor sequence, we mutated the anchor sequence and use the mutated sequences as positive sequences. I might overlooked the implementation of this part in this repo. Thanks for pointing out ^^
Ah I see. Thanks for posting this. _x is created when only one sequence has a particular EC number. Since we need to sample positive sequences other than the anchor sequence, we mutated the anchor sequence and use the mutated sequences as positive sequences. I might overlooked the implementation of this part in this repo. Thanks for pointing out ^^
Ok, btw, how do u make the mutated sequences as positive sequences , it seems it doesn't be mentioned at the origin paper or maybe i didn't notice it,lol. Thx for replying~
@1511878618 Ahh, this was a mistake while cleaning up our development code. We have a script for mutating the sequence for the EC numbers with only one sequence (to enable gradient computation for the contrastive loss). This script should go before the training and after the retrieval of the ESM embeddings. The inference shouldn't be affected because we provided the embeddings for each EC number (70.pt
and 100.pt
), but you are right about running into errors during training. I will create a PR to fix this, thanks for bringing this up!
@1511878618 PR #15
There is a bug in the
random_positive
function indataloader
. When there is only one protein ID in EC, like EC:3.4.22.54, and it only have ['Q9TTH8'] in itSo i think the question is here
all in the comments is my opinion, and i will raise a PR soon later