Open e-maud opened 4 days ago
Here is suggestion how I would do it for the impresso LID notebook @flipz357
This notebook provides a hands-on demonstration of language identification (LID) using our Impresso LID model from Hugging Face. We will explore how to download and utilize this model to predict the language of Impresso-like text inputs. This notebook walks through the necessary steps to set up dependencies, load the model, and implement it for practical language identification tasks.
By the end of this notebook, you will:
floret
and huggingface_hub
).Yes, I had something similar in mind, but the "What you will learn" is a great idea. Thanks, Simon!
Just for point 4., do you mean something specific here? Or did you want to say "Have gained insight", sort of as a summary of 1-3.
This was just redacted chatgpt output. Yes, 4 probably is just a summary, but it can also refer to a few explanation how fasttext-language identification works (https://fasttext.cc/docs/en/language-identification.html). We just use floret because the unquantized models are not as large and floret is actually maintained on pypi.
@article{joulin2016fasttext,
title={FastText.zip: Compressing text classification models},
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
journal={arXiv preprint arXiv:1612.03651},
year={2016}
}
Feel free to adapt what we actually cover in the improved version:-) The dominant language is just a single label prediction. You mentioned yesterday that looking at the distribution of labels could be helpful for truly multilingual articles (they exist indeed). A nice example would be needed there to showcase it.
Mainly introductory section; points to have