Generic basis for notebook

e-maud commented 4 days ago

Mainly introductory section; points to have

simon-clematide commented 4 days ago

Here is suggestion how I would do it for the impresso LID notebook @flipz357

What is this notebook about?

This notebook provides a hands-on demonstration of language identification (LID) using our Impresso LID model from Hugging Face. We will explore how to download and utilize this model to predict the language of Impresso-like text inputs. This notebook walks through the necessary steps to set up dependencies, load the model, and implement it for practical language identification tasks.

What will you learn in this notebook?

By the end of this notebook, you will:

Understand how to install and configure the required libraries (floret and huggingface_hub).
Learn to load our trained Floret language identification model from Hugging Face.
Run the model to predict the dominant language (or the mix of languages) of a given text input.
Gain insight into the core functionality of language identification using machine learning models.

flipz357 commented 4 days ago

Yes, I had something similar in mind, but the "What you will learn" is a great idea. Thanks, Simon!

Just for point 4., do you mean something specific here? Or did you want to say "Have gained insight", sort of as a summary of 1-3.

simon-clematide commented 4 days ago

This was just redacted chatgpt output. Yes, 4 probably is just a summary, but it can also refer to a few explanation how fasttext-language identification works (https://fasttext.cc/docs/en/language-identification.html). We just use floret because the unquantized models are not as large and floret is actually maintained on pypi.

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

Feel free to adapt what we actually cover in the improved version:-) The dominant language is just a single label prediction. You mentioned yesterday that looking at the distribution of labels could be helpful for truly multilingual articles (they exist indeed). A nice example would be needed there to showcase it.

impresso / impresso-datalab-notebooks

Generic basis for notebook #15

What is this notebook about?

What will you learn in this notebook?