What is training input data format?

lightonai / RITA

RITA is a family of autoregressive protein models, developed by LightOn in collaboration with the OATML group at Oxford and the Debora Marks Lab at Harvard.

MIT License

88 stars 8 forks source link

What is training input data format? #10

Closed yzhang-github-pub closed 1 year ago

yzhang-github-pub commented 1 year ago

Dear Author,

I am fine-tuning your pretrained RITA with a protein family data, using run_clm.py script @ huggingface. I tried this format where seq1 & seq2 are protein sequences 1 & 2 without white space:

seq1 <|endoftext|> seq2 <|endoftext|> ...

and also this format: seq1 seq2 ...

Training seemed to be successful. However, sequences generated by the fine tuned model contain a lot of '' tokens.

Please advise. Thanks.

DanielHesslow commented 1 year ago

I'm not particularly familiar with the huggingface code base, and I do not currently have the time to read up one the specifics.

The format used during training is:

seq1
<EOS>
seq2
<EOS>
...

The issue does however seem unrelated to the input format, the version without <EOS> should also work fine. It outputting " Is strange since this token is not part of our vocabulary, I'd make sure that you train using our tokenizer:

tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")

pzhang84 commented 1 year ago

Hi @DanielHesslow, Could you please advise what tokenizer you used for training RITA, and what is the vocabulary size? Thanks!

DanielHesslow commented 1 year ago

tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s") Is indeed the correct tokenizer. the vocab size is 26.