Closed sanchit-ahuja closed 2 years ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
@sanchit-ahuja Thank you for the pull request, Sorry for the long delay in reviewing this! Before we merge this in, may I make a few requests?
There's a few changes I'd like to suggest, that I think might improve the notebook and make it easier for people to use:
!pip install datasets
, to install the necessary libraries? If you try to take this current version and run it in Google Colab, people who are not experienced with Colab would not know what to do with the errors. Currently the format looks like this when you print it out:
So for example dataset["train"][0]
(the first item of the train split) is in eng, and dataset["train"][1]
is in epo language. Whereas on HuggingFace Tatoeba if you got dataset["train"][0] that would have both languages, see this:
If we make those changes I think it will make a good demo of how to load text files into the same format as the other datasets on HuggingFace! Then people will be able to freely use either text files or HuggingFace datasets as is convenient for them.
What do you think? I'm open to being persuaded otherwise!
@cdleong I have added a custom script to load the data into a custom HuggingFace object. You can have a look at this for more info. Added various comments in the script as well. Please have a look.
@cdleong please take a final look before merging :)
Looks good to me!
Thanks for the reminder!
I have added the notebook to convert a dataset from the Tatoeba challenge to a huggingface dataset object. Please review!