masakhane-io / masakhane-mt

Machine Translation for Africa
MIT License
278 stars 206 forks source link

Added the notebook to transfer data to a huggingface dataset object #201

Closed sanchit-ahuja closed 2 years ago

sanchit-ahuja commented 3 years ago

I have added the notebook to convert a dataset from the Tatoeba challenge to a huggingface dataset object. Please review!

review-notebook-app[bot] commented 3 years ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

cdleong commented 2 years ago

@sanchit-ahuja Thank you for the pull request, Sorry for the long delay in reviewing this! Before we merge this in, may I make a few requests?

There's a few changes I'd like to suggest, that I think might improve the notebook and make it easier for people to use:

  1. Can we add a cell at the top with !pip install datasets, to install the necessary libraries? If you try to take this current version and run it in Google Colab, people who are not experienced with Colab would not know what to do with the errors.
  2. Could we rework it to be in datasets.Translation format, see here? This is the same format that other datasets on HuggingFace such as Tatoeba are in (see https://huggingface.co/datasets/tatoeba)?

Currently the format looks like this when you print it out: image So for example dataset["train"][0] (the first item of the train split) is in eng, and dataset["train"][1] is in epo language. Whereas on HuggingFace Tatoeba if you got dataset["train"][0] that would have both languages, see this: image

If we make those changes I think it will make a good demo of how to load text files into the same format as the other datasets on HuggingFace! Then people will be able to freely use either text files or HuggingFace datasets as is convenient for them.

What do you think? I'm open to being persuaded otherwise!

sanchit-ahuja commented 2 years ago
  1. I will add that code snippet
  2. Sure, that makes sense. Will make the required changes for it as well!
sanchit-ahuja commented 2 years ago

@cdleong I have added a custom script to load the data into a custom HuggingFace object. You can have a look at this for more info. Added various comments in the script as well. Please have a look.

juliakreutzer commented 2 years ago

@cdleong please take a final look before merging :)

cdleong commented 2 years ago

Looks good to me!

cdleong commented 2 years ago

Thanks for the reminder!