oxpig / CaLM

Protein language model trained on coding DNA
BSD 3-Clause "New" or "Revised" License
38 stars 10 forks source link

The issue of data usage #7

Open yyly6 opened 6 months ago

yyly6 commented 6 months ago

Hello! I noticed that in the data you provided, some sequences do not begin with "ATG", for example, 'TTGAAAAGAAAAGCCAGTATCATGTTTGTCCATCAAGACAAGTACGAAGAATACAAACAGCGGCATGATGACATTTGGCCTGAGATGGCAGAAGCACTCAAAGCTCATGGAGCACACCATTATTCCATTTTTCTAGACGAGGAAACAGGCAGGCTTTTTGCATATTTAGAAATAGAGGATGAAGAGAAATGGAGAAAGATGGCGGACACGGAAGTTTGCCAAAGATGGTGGAAATCGATGGCGCCATTAATGAAAACAAATTCGGATTTCAGTCCTGTTGCGATAGATCTAAAGGAAGTTTTTTATTTGGATTGA'. When tokenizing, should I discard the part before ATG and start from ATG, or should I just use the entire sequence as it is? Similarly, when translating it into an amino acid sequence, should I translate the entire sequence directly or start translating from ATG?