Open ZhiyuanChen opened 6 months ago
Hi,
Thank you for your work.
I noticed this is trained on cDNA data, while the tokeniser seems to use RNA vocab (https://github.com/oxpig/CaLM/blob/main/calm/alphabet.py)
Can you please clarify the data preprocessing pipeline?
I think you can read https://github.com/oxpig/CaLM/blob/main/calm/sequence.py this script, they defined a class CodonSequence to replace 'T' with 'U'.
Hi,
Thank you for your work.
I noticed this is trained on cDNA data, while the tokeniser seems to use RNA vocab (https://github.com/oxpig/CaLM/blob/main/calm/alphabet.py)
Can you please clarify the data preprocessing pipeline?