oxpig / CaLM

Protein language model trained on coding DNA
BSD 3-Clause "New" or "Revised" License
38 stars 10 forks source link

AlphaBet #8

Open ZhiyuanChen opened 6 months ago

ZhiyuanChen commented 6 months ago

Hi,

Thank you for your work.

I noticed this is trained on cDNA data, while the tokeniser seems to use RNA vocab (https://github.com/oxpig/CaLM/blob/main/calm/alphabet.py)

Can you please clarify the data preprocessing pipeline?

Cassie818 commented 2 months ago

Hi,

I think you can read https://github.com/oxpig/CaLM/blob/main/calm/sequence.py this script, they defined a class CodonSequence to replace 'T' with 'U'.