Refactor and fix splitting to shards

microsoft / AzureML-BERT

End-to-End recipes for pre-training and fine-tuning BERT using Azure Machine Learning Service

https://azure.microsoft.com/en-us/blog/microsoft-makes-it-easier-to-build-popular-language-representation-model-bert-at-large-scale/

MIT License

394 stars 127 forks source link

Refactor and fix splitting to shards #56

Closed dlazesz closed 4 years ago

dlazesz commented 4 years ago

There were multiple problems with the previous implementation:

The document separator definition was not used at all
Major bug: The last shard was not written (just buffered) when there was no empty line at the end of the input file.
No encoding were given to files
There were a few unnecessary operations

msftclas commented 4 years ago

All CLA requirements met.

dlazesz commented 4 years ago

I have updated the patch to be liberal in what it accept, and strict in what it produce: It does not matter if the input file has a doc_separator at the end or not. The output does neither contain doc_separator at the beginning nor at the end.

@xiaoyongzhu @aashnamsft

Please, review this pull request, if possible.

sassbalint commented 4 years ago

@xiaoyongzhu @aashnamsft

Please, review this pull request and merge it, if possible. Thank you!

dlazesz commented 4 years ago

Updated PR according to the review.