jasonkyuyim / multiflow

https://arxiv.org/abs/2402.04997
MIT License
106 stars 4 forks source link

Can this be a problem when choosing the same token id for both mask token and unknown residues? #8

Open smiles724 opened 2 hours ago

smiles724 commented 2 hours ago

Hi, Jason,

Thanks for providing the code. I have a question over your choice of MASK_TOKEN_INDEX, where you resort to MASK_TOKEN_INDEX = residue_constants.restypes_with_x.index('X'). That is, the mask token index is equivalent to the X amino acid.

However, this may cause some errors when calculating the loss. From my point of view, you forced the predicted probabilities of unknown residue (X) to be close to 0.

image

However, when you processed the data, any unknown residues were indexed using X's index. image

Therefore, it seems weird to me that the GT for those unknown residues is X' index but their predicted prob. of X is tuned to 0. Though no loss imposed in this case (so it is acceptable?).

smiles724 commented 2 hours ago

P.S. why do not use another independent token index (i.e. 22) for the mask token?