Open wjy979769265 opened 5 years ago
Same here.
I have ~40k+ text files (~500mb) .
What should be the best approach for data preprocessing?
Should I add <|startoftext|>
and <|endoftext|>
tokens to the beginning and the end of files?
I use .npz encoder so what would be the best value for the combine
parameter?
How do I decrease loss?
Same here. I have ~40k+ text files (~500mb) . What should be the best approach for data preprocessing? Should I add
<|startoftext|>
and<|endoftext|>
tokens to the beginning and the end of files? I use .npz encoder so what would be the best value for thecombine
parameter? How do I decrease loss?
I added <|startoftext|>
and <|endoftext|>
tokens to all of the files respectively. Seems it improved the accuracy (i mean, decreased the loss) a little bit. I didn't change the combine parameter because I thought <|startoftext|>
and <|endoftext|>
tokens did the job.
There is no end sign in my dataset, in order to stop the generation for each sentence. I add start_token and end_token for each row, after finetunning, it generated sort of unreadable texts...It is because the BPE encoding or just the model treat the token as the normal text to train? After that, in order to achieve this function, I add a "." to each row, I got more readable text, but still not perfect, and I don't know how to decrease the loss... Any one who have the idea please help me. Thank you very much.
overall all, it's about three quesitions: