This PR introduces a script to preprocess Arabic text from 'The Kuwaiti Encyclopaedia of Islamic Jurisprudence', removing diacritics to optimize its use with Vectara models. This change aims to enhance semantic search capabilities.
Following Waleed's feedback, it also includes clear instructions in the README.md on how reproduce the data. This ensures maintainability and the ability to reconstruct the dataset if necessary.
This PR introduces a script to preprocess Arabic text from 'The Kuwaiti Encyclopaedia of Islamic Jurisprudence', removing diacritics to optimize its use with Vectara models. This change aims to enhance semantic search capabilities.
Following Waleed's feedback, it also includes clear instructions in the README.md on how reproduce the data. This ensures maintainability and the ability to reconstruct the dataset if necessary.