kingoflolz / mesh-transformer-jax

Model parallel transformers in JAX and Haiku
Apache License 2.0
6.29k stars 892 forks source link

Fine Tuning Dataset Format #193

Open teamnetsol opened 2 years ago

teamnetsol commented 2 years ago

Hi there, I am trying to fine-tune the gpt-j-6b model using my custom dataset. I am trying to figure out the correct format for my dataset. Currently, when generating tfrecords, I have tried the following formats:

MY_CATEGORY: ASSOCIATED_TEXT MY_CATEGORY: ASSOCIATED_TEXT##### MY_CATEGORY: ASSOCIATED_TEXT<|endoftext|> <|endoftext|>MY_CATEGORY: ASSOCIATED_TEXT<|endoftext|> "<|endoftext|>MY_CATEGORY: ASSOCIATED_TEXT<|endoftext|>"

I applied these formats to the whole dataset. The resulting model produced outputs which seemed to suggest the model was unable to recognize the "<|endoftext|>" or "#####" separator token.

Any information on this would be helpful. Please and thank you

jingrongchen commented 2 years ago

Hi i am also trying to figure out the correct format of it? did you make it. i want to fine tuning the model to do a chatbot. but i don't know how my data format should be