lbcb-sci / RiNALMo

RiboNucleic Acid (RNA) Language Model
https://sikic-lab.github.io/
Apache License 2.0
43 stars 6 forks source link

Same RNA, different representation #6

Closed ylzdmm closed 2 months ago

ylzdmm commented 2 months ago

Hello, is this normal? my test.py: import torch from rinalmo.pretrained import get_pretrained_model

DEVICE = "cuda:0"

model, alphabet = get_pretrained_model(model_name="rinalmo_giga_pretrained") model = model.to(device=DEVICE) seqs = ["CCCGGU","CCCGGU"]

tokens = torch.tensor(alphabet.batch_tokenize(seqs), dtype=torch.int64, device=DEVICE) with torch.no_grad(), torch.cuda.amp.autocast(): outputs = model(tokens)

print(outputs["representation"]) but the output of two same sequences is different: python test.py tensor([[[ 0.0209, -0.3792, -0.9592, ..., -0.3661, -0.4986, -1.0630], [-0.1543, -0.8713, -0.6534, ..., -0.7442, -0.4688, 0.0491], [-0.1923, -1.0140, -1.3560, ..., -2.0971, -1.1946, -0.7145], ..., [ 1.5532, -1.9415, -1.3395, ..., -1.3404, -1.1100, 0.9047], [-0.1968, -0.5992, 0.3608, ..., -1.4525, -0.8330, 0.4122], [-1.1677, 0.0836, -0.1704, ..., -0.8856, -0.8993, -0.1143]],

    [[ 0.1576, -0.0849, -1.1658,  ..., -0.1120, -0.8494, -0.4571],
     [ 0.2583, -0.0431, -0.1226,  ..., -1.9443, -0.7913,  0.4501],
     [ 0.0782, -0.8882, -0.7555,  ..., -0.7302, -1.6658,  0.0445],
     ...,
     [ 1.3045, -1.9552, -2.3737,  ..., -0.5877, -1.6685,  0.6632],
     [ 0.5900, -0.9660, -0.0392,  ..., -1.1003, -2.0937,  1.4232],
     [-0.7117, -0.8371, -0.3525,  ..., -1.1058, -1.0734, -0.6338]]],
   device='cuda:0')

the pre-trained weight download from https://zenodo.org/records/10725749/files/rinalmo_giga_pretrained.pt

RJPenic commented 2 months ago

Hello, RiNALMo utilizes dropout layers and because of them you'll have an undeterministic output. If you want to switch off dropout layers, switch the model into evaluation mode (model.eval()). That way you'll always get the same output for the same sequence.

ylzdmm commented 2 months ago

Hello, thank you for your reply, I have solved it, but now I have a problem: I modified my test.py file as follows: import torch from rinalmo.pretrained import get_pretrained_model

DEVICE = "cuda:0"

model, alphabet = get_pretrained_model(model_name="rinalmo_giga_pretrained") model.eval() model = model.to(device=DEVICE) seqs = ["CCCGGU"]

tokens = torch.tensor(alphabet.batch_tokenize(seqs), dtype=torch.int64, device=DEVICE) with torch.no_grad(), torch.cuda.amp.autocast(): outputs = model(tokens)

for rep in outputs["representation"]: print(rep.shape)

output: torch.Size([8, 1280])

if seqs = ["ACUUUGGCCA"] output: torch.Size([12, 1280])

if seqs = ["ACUUUGGCCA","CCCGGU"] output: torch.Size([12, 1280]) torch.Size([12, 1280])

It seems that the output dimension is determined by the maximum sequence length of the input(Every sequence begins with a [CLS] token,and ends with an [EOS] token), and the excess dimensions are filled according to your rules. Can I understand that each 1280 tensor represents a base?

But according to your paper: an RNA sequence is tokenized and turned into a 1280 dimension vector using a learned input embedding model.

How do I understand the meaning of this output, and how do I fix the sequence dimensions to facilitate my downstream tasks, such as predicting interactions between RNAs?