waleedkadous / ansari-backend

Ansari is a helper for you to become a better Muslim
65 stars 12 forks source link

Add script for Arabic diacritic stripping, and a README #28

Closed abdullah-alnahas closed 3 months ago

abdullah-alnahas commented 3 months ago

This PR introduces a script to preprocess Arabic text from 'The Kuwaiti Encyclopaedia of Islamic Jurisprudence', removing diacritics to optimize its use with Vectara models. This change aims to enhance semantic search capabilities.

Following Waleed's feedback, it also includes clear instructions in the README.md on how reproduce the data. This ensures maintainability and the ability to reconstruct the dataset if necessary.

waleedkadous commented 3 months ago

Great work! Lgtm.