Closed yzhang-github-pub closed 1 year ago
I'm not particularly familiar with the huggingface code base, and I do not currently have the time to read up one the specifics.
The format used during training is:
seq1
<EOS>
seq2
<EOS>
...
The issue does however seem unrelated to the input format, the version without <EOS>
should also work fine. It outputting "
Is strange since this token is not part of our vocabulary, I'd make sure that you train using our tokenizer:
tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")
Hi @DanielHesslow, Could you please advise what tokenizer you used for training RITA, and what is the vocabulary size? Thanks!
tokenizer = AutoTokenizer.from_pretrained("lightonai/RITA_s")
Is indeed the correct tokenizer. the vocab size is 26.
Dear Author,
I am fine-tuning your pretrained RITA with a protein family data, using run_clm.py script @ huggingface. I tried this format where seq1 & seq2 are protein sequences 1 & 2 without white space:
seq1 <|endoftext|> seq2 <|endoftext|> ...
and also this format: seq1 seq2 ...
Training seemed to be successful. However, sequences generated by the fine tuned model contain a lot of '' tokens.
Please advise. Thanks.