kaisdukes / quranic-corpus

The Quranic Arabic Corpus, an invaluable linguistic resource, is due for a revamp. We're calling on Linguistics, AI, and Tech volunteers to join us in this exciting journey. 🚀
https://qurancorpus.app
GNU General Public License v3.0
80 stars 8 forks source link

Treebank: use surahapp to complete #64

Closed mustafa0x closed 1 year ago

mustafa0x commented 1 year ago

Al-i’rāb al-mufassal is good, but surahapp is word by word and quite thorough, so I assume will be a lot easier to parse. They also have complete sarf.

Sarf

https://web.surahapp.com/ar/quran?surah=2&view=reading&page=3&word=9&aya=7&use-quran-app=true&tab=6&filter=articulation&content-info-model=false&stats-type=cols&aya-counting-key=adad_ayat-sowar_fn&keep-word-change=true&d-aya=true

image

iraab

https://web.surahapp.com/ar/quran?surah=2&view=reading&page=3&word=9&aya=7&use-quran-app=true&tab=6&filter=tasreef&content-info-model=false&stats-type=cols&aya-counting-key=adad_ayat-sowar_fn&keep-word-change=true&d-aya=true

image

danglingneuron commented 1 year ago

Is it available as individual txt/json/xml file? or does it have to be scraped from the website? Irab Mufassil was parsed from html files.

mustafa0x commented 1 year ago

Not, not available, but I can reach out and see whether they're interested in providing.

kaisdukes commented 1 year ago

@mustafa0x. We will never say no to more data! 😃

The treebank currently covers 50% of the Quran and we have used al-i’rāb al-mufassal as the main reference work for this. I think it would be an excellent idea to have an additional reference work, and we could compare both. For attribution, transparency, and to ensure we can trace back to the original author, it would be great to know the primary source for this additional linguistic analysis.

If you are able to reach them, it would be good to confirm the original publication so we can cite this, as reliable citations are key to the project. Even better would be an extract of their database. As you rightly point out, having syntactic roles at word level is invaluable.

Another reason to know the original source, beyond just reliable citations, is to ensure compliance with copyright and fair use.

Having the AI read both reference works could really speed up completion of the treebank, as you rightly point out.