Closed mscherrmann closed 1 year ago
Hi, we don't have a specific example for training bert on your own data, but all that is required is for you to replace our dataloader with your own normal pytorch dataloader. See https://github.com/mosaicml/examples/blob/efbcadf774b339ed0ffaa6e655d75d64c12e2564/examples/benchmarks/bert/main.py#L132-L137 for where we construct the dataloader, which ends up creating a StreamingTextDataset. You are certainly not required to use your streaming library if you would prefer to just provide your own torch dataset/dataloader, but you can also check out https://github.com/mosaicml/examples/blob/main/examples/benchmarks/bert/src/convert_dataset.py for how we construct a streaming dataset out of c4.
Depending on your use case, you might consider simply converting your data to the streaming dataset format
Thank you, I'll just covert the data to the streaming dataset format
Hi,
is there a documentation of how to proceed if one likes to train the moisaic-bert model on own data? If I get it correctly, the only thing you mention to that end is:
Can you be more precise?
Thank you very much in advance