mosaicml / examples

Fast and flexible reference benchmarks
Apache License 2.0
424 stars 122 forks source link

Train BERT on own data #381

Closed mscherrmann closed 1 year ago

mscherrmann commented 1 year ago

Hi,

is there a documentation of how to proceed if one likes to train the moisaic-bert model on own data? If I get it correctly, the only thing you mention to that end is:

Alternatively, feel free to substitute our dataloader with one of your own in the script main.py.

Can you be more precise?

Thank you very much in advance

dakinggg commented 1 year ago

Hi, we don't have a specific example for training bert on your own data, but all that is required is for you to replace our dataloader with your own normal pytorch dataloader. See https://github.com/mosaicml/examples/blob/efbcadf774b339ed0ffaa6e655d75d64c12e2564/examples/benchmarks/bert/main.py#L132-L137 for where we construct the dataloader, which ends up creating a StreamingTextDataset. You are certainly not required to use your streaming library if you would prefer to just provide your own torch dataset/dataloader, but you can also check out https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/src/convert_dataset.py for how we construct a streaming dataset out of c4.

jacobfulano commented 1 year ago

Depending on your use case, you might consider simply converting your data to the streaming dataset format

mscherrmann commented 1 year ago

Thank you, I'll just covert the data to the streaming dataset format