microsoft / AzureML-BERT

End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service
https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/
MIT License
394 stars 127 forks source link

Refactor and fix splitting to shards #56

Closed dlazesz closed 4 years ago

dlazesz commented 4 years ago

There were multiple problems with the previous implementation:

  1. The document separator definition was not used at all
  2. Major bug: The last shard was not written (just buffered) when there was no empty line at the end of the input file.
  3. No encoding were given to files
  4. There were a few unnecessary operations
msftclas commented 4 years ago

CLA assistant check
All CLA requirements met.

dlazesz commented 4 years ago

I have updated the patch to be liberal in what it accept, and strict in what it produce: It does not matter if the input file has a doc_separator at the end or not. The output does neither contain doc_separator at the beginning nor at the end.

@xiaoyongzhu @aashnamsft

Please, review this pull request, if possible.

sassbalint commented 4 years ago

@xiaoyongzhu @aashnamsft

Please, review this pull request and merge it, if possible. Thank you!

dlazesz commented 4 years ago

Updated PR according to the review.