masakhane-io / masakhane-mt

Machine Translation for Africa
MIT License
278 stars 206 forks source link

Create new notebooks that do not rely on JW300 #200

Open cdleong opened 3 years ago

cdleong commented 3 years ago

Slack discussion: https://masakhane-nlp.slack.com/archives/C01JAP67HRV/p1634844082006400

image

https://github.com/joeynmt/joeynmt/blob/master/joey_demo.ipynb is the Tatoeba example.

cdleong commented 3 years ago

One suggestion in the slack would be to break the new notebook code into two parts

cdleong commented 3 years ago

One suggestion in the slack would be to break the new notebook code into two parts

* One notebook that takes in a HuggingFace dataset at the top, and proceeds from there to train a JoeyNMT model. This might make things a lot easier on people. If they can get data into the HuggingFace Dataset format, we can show them how to train.

* One notebook that shows people how to do it: loads in data from various filetypes or sources (.csv, paired text files, directly from the HuggingFace hub) to HuggingFace format: https://huggingface.co/docs/datasets/loading_datasets.html

See this slack discussion: https://masakhane-nlp.slack.com/archives/C01GF5XJ0TF/p1634863777007500?thread_ts=1634844471.007300&cid=C01GF5XJ0TF

cdleong commented 3 years ago

https://colab.research.google.com/drive/1RWOle7RHy_wq0uGWxmAq1ZfmEQIFsCHj#scrollTo=h1Ddy4_AOKdm could make for a starting point. This notebook shows how to download a HuggingFace dataset and write it out to files of the format JoeyNMT expects... I think

pixelsandpointers commented 1 year ago

@cdleong if this is still relevant, I would like to work on it.

cdleong commented 1 year ago

I think it is still relevant, yes. And I just got done with my semester so I might have more free time as well, after the holidays

On Mon, Dec 12, 2022, 1:43 PM Benjamin Beilharz @.***> wrote:

@cdleong https://github.com/cdleong if this is still relevant, I would like to work on it.

— Reply to this email directly, view it on GitHub https://github.com/masakhane-io/masakhane-mt/issues/200#issuecomment-1346174627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7LHRL4ICNYUXJOEAVODMTWM3XSNANCNFSM5GO6SEBQ . You are receiving this because you were mentioned.Message ID: @.***>

pixelsandpointers commented 1 year ago

Alright, so I have started with the notebook and will be done by the end of next week. I have to prepare for an exam next Wednesday, but I will be wrapping up the notebook.

/self-assign

smyja commented 1 year ago

Alright, so I have started with the notebook and will be done by the end of next week. I have to prepare for an exam next Wednesday, but I will be wrapping up the notebook.

/self-assign

Any update?