Closed mmahalwy closed 5 years ago
What do you mean by "parsed"? Extracting the text? Converting it to XML/Docbook format? Converting it to JSON? It seems to me like a huge manual work.
Probably json would be best, or postgresdb.
Not manual, just need to parse the PDF by searching the titles of the surahs and the individual titles of each section within the surah section.
As far as I know, converting PDF to plain text is not 100% reliable (there will always be things that need to be fixed manually). Once this is done, it should not be very difficult to parse the text and split it into sections and subsections. But the first task seems difficult to me (unless you know some way of parsing directly a pdf file).
can we do pdf to html? It's easy to go from there
Salams. I reached out to alim.org to see if they'd be interested in providing their source/db: http://www.alim.org/library/quran/AlQuran-tafsir/TIK/1/0
Initial reply is that they don't offer source or DB. I've asked for a phone meeting to discuss further. Should we suggest our own scraping effort? (Is that considered rude?) Should we embed their content directly in our app? I imagine that would increase their server load significantly. Any suggestions on the best ways to collaborate with them?
@reshadn I think we can suggest scrapping for the sake of turning it into a DB that we can share with them? That might work? Or any help they can give us - for example, html files will make scrapping efforts easier?
I'll scrape it guys and index it by PAGE number based on the 604 pages mushaf. I will share the database file when I'm done.
Maybe I can do the "sura + ayah" number correlating page number index table as well, but I think maybe you already got that? (noticed the "line" for each new page in read mode).
@ATouhou thanks for that! That'd be amazing. We do have the surah/ayah/line/page/position relationship.
As a follow up the one person I spoke with didn't offer anything. I'm not sure if that's because they're not clear on who we are or what we're doing. Or possibly because they don't have bandwidth to follow up.
On Thu, Jul 14, 2016 at 9:03 AM Mohamed El Mahallawy < notifications@github.com> wrote:
@ATouhou https://github.com/ATouhou thanks for that! That'd be amazing. We do have the surah/ayah/line/page/position relationship.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/quran/quran.com-frontend/issues/309#issuecomment-232710968, or mute the thread https://github.com/notifications/unsubscribe/AAZ8jWnF928cRY-bpSVzYsd4NLG-DtJ_ks5qVl3ggaJpZM4IkEZ7 .
@reshadn I have tried long and hard and explained everything to them detailed. The source code for their webpage (alim) is even on github and there is still several exploits (just check version numbers on the live webpage with some common exploits...) left in it, that they have not fixed! In regards to both qtafsir.com & recitequran.com (which are related), they do not want to share any data as well.
So this led me to finally make a scraper which indexes the english tafseer ibn kathir.
I made some split tests and everything looks good :-) You can grab the database here: https://drive.google.com/open?id=0B4VHPzkqEV3KY2tGV0lwaW9talE
I might index their other 3 tafaseer, (all arabic. Ibn kathir, saidi and baghawi) at some point. Also i might at somepoint corrolate the ayah chosen (for search) to the right "section" on the particular "page" (some/most pages have several "sections", as is the structure in tafseer ibn kathir) . However I am extremely busy with work these days and actually behind deadline on some projects, so it will most likely be at some point later in Sha Allah...
This was just a quick side jump, as I think it will be very beneficial for the Ummah! Wal Hamdoullilah. Anyways let me know if you have any questions in regards to this or/and if everything works as it should :-)
You can remove the column "uniqueid and tafsirid", these were for internal purposes, to when/if I decide to grab their other tafaseer :p
the 3 arabic tafaseer will only some minutes works on the scraper to make it work, give me the headsup if you want / need it :-)
Almost forgot, I can also scrape the audio resources for word-by-word arabic recitation (and the "2 connected words tajweed rules" sounds) and index it. Data files I already scraped way back :p
@ATouhou where would the audio files come from?
Wisam Sharif at Recitequran.com They also have English word-by-word but that is computer generated (if I remember correctly). Actually we could also use the word-by-word transliteration that the brother Ozair has. (Microsoft guy)
@ATouhou i do have the word-by-word arabic from Wisam Sharif but we dont have permissions
@mmahalwy You asked for permission to use Wisam Sharieff's stuff and he didn't give? Or couldn't get in touch?
Permissions should be granted by "Share Islam"
Number (800)409-5044 or email contact@shareislam.com
I tried through livechat, email (on qtafsir and shareislam), tried on contactform on recitequran and tried on recitequran.com over a year. Yielded down to, decision belongs to Yusuf Estetes
But I did not contact Yusuf Estes on the mail provided. Please feel free to do, and see if you manage to get some permissions :-)
Ah, recitequran.com used to be allahsquran.com/learn, so makes sense it goes back to Sh. Yusuf Estes.
What are we looking to get from it?
Just permissions to use their data (audio data), they actually don't need to provide us anything if it will be too much of a hassle for them. We got the tafseer ibn kathir now, but let us see if Mohammad wants anything else
@mmahalwy Which Tafaseer do you have indexed at the moment? I provided the Tafsir Ibn Kathir (English), do you need anything else? Arabic tafaseer indexed, Or maybe parts of English tafsir as saidi (I think only 1-2 juz are available in english) ?
@ATouhou the problem right now is we have tafsir indexed per ayah and ibn kathir is not per ayah. I don't remember which we have indexed but i can check?
The tafsir ibnu kathir is indexed per page 1-604.
If you don't already got it (don't think I was given access to the quran.com database) then we could add a table "AyatToPage" where we have a column with page and a column with aya. then when looking up tafseer from ibn kathir , we lookup in that table "AyatToPage" for aya say 1:5 , where it will say, ayah 1:5 is on page 1, and from that we lookup the tafseer.
And the HTML includes data on what section to start on at what ayah
CC @naveed-ahmad ^^ since he is rebuilding the db from scratch
@ATouhou can you provide tasfir in json format ? It'll be really easy for me to import json and add it to v3. Following format would be lot of help:
{ "1": {
"1": "tasfir",
"2": "tafisr"
},
"2": {
}
}
or
{ "surah_number1": {
"ayah1": "tasfir",
"ayah2": "tafisr"
},
"surah2": {
}
}
thanks
Also removes div
and other html tags, we need to add and index only content no html :)
@ATouhou just emailed Sheikh Yusuf. Will update you. CC @ahmedre
@naveed-ahmad I just saw your message 1.5 year late :P Do you still need the data file in another format? I was just following up on some 4000 tabs in "OneTab" that has been gathering up the last 2 years.
@naveed-ahmad @mmahalwy This website https://quranenc.com/en/home has many quran Tafsir/Translations which are downloadable in a verse by verse XML format. I hope it can be imported into Quran.com and Quran Android.
@AbdullahObaid we've already imported their translations, all of em :)
@naveed-ahmad @AbdullahObaid just curious how the content is being validated? Not doubting the source but is there a mechanism in place to make sure the source is reliable.
Also Jazakallah khair for the source, saved it.
The quranenc team works with islamhouse, a popular Islamic site, and their translators are transcribing and validating the data. from what i understand, they also have scholars on board to help validate the authenticity of the tafaseer, etc.
The quranenc team works with islamhouse, a popular Islamic site, and their translators are transcribing and validating the data. from what i understand, they also have scholars on board to help validate the authenticity of the tafaseer, etc.
Awesome.
Asalamou Alaikoum'
Just curious is there a reason my request to contribute to this repo not being acknowledged? I sent an email asking Mohamed and Naveed who my friends here actually know (sheikh Hossam Helal, and Mohamed Al-Hassan) if I can use the database for my own use, and then I realized this was not allowed, so I just got my own translation from tanzil. Regardless I would like to contribute from my 10 years of engineering in Toronto I hope to bring some value, fix some issues, and get the reward from Allah (SWT) in the akhira. Wallahi you guys have a huge blessing with thousands reading it online and I'm extremely jealous I don't have a share of this barakah. You guys will be rewarded even while you are passed from this world because people will benefit from quran.com Everyone speaks highly of Mohamed here in Hamilton and I want to take part of this network of muslim developers.
I will jump on a skype call if need be, but you have my word and my trust by Allah (SWT) I will only use the resources and intellectualy property strictly for the benefit and use of Quran.com
Jazakallah khair I am eager to help out.
wa3laikum alsalam, everyone is free to open PRs and the team tries to be quick about reviewing and merging whatever PRs people open - I checked and don't see any PRs open from you, but I might have missed them. In any case, please also realize we're volunteers - this isn't a full time job for us - so we're unfortunately not always as fast as we'd like to be.
you're more than welcome to open a PR for anything and the team would be more than happy to look at it, give feedback, and accept changes as such. walsalam 3alaikum.
Jazakallah khair akhi kareem, I do not have a PR because I assume to actually work and test the frontend I would need an api key to pull the quran.com database from. I definitely understand that everyone is busy and trying their best to contribute.
waAlakim o salam brother @nasir-scalestack. Just double checked and didn't saw any email from you, when did you sent?
Anyway, you're always welcome to contribute. There are lots of issues we need to work on, please send your email and ill invite you on slack.
You don't need any key for accessing the api, its free and open for everyone. Here is the api docs: https://quran.api-docs.io/
Shukran brother Naveed, my apologies maybe it ended up in the spam folder, I didn't mean to accuse you guys of ignoring me. My email is nasir@scalestack.io for slack.
Since no api key is required I'll check everything out tonight insha'allah and see where I can contribute.
Allahuma barak feek.
Yeah, it was in spam :( invited you to slack channel.
I need Quranic Tafsir database English version. Can anyone help?
any one share quran tafsir ibn kaseer in json. we need it in json
@mimrank Hi, I found this repo, I did not check the data https://github.com/GreentechApps/Al-Quran/tree/master/dbs . Otherwise you could scrap websites or APIs, I don't really understand why they refuse to share the database and have one source that everyone could contribute to.
http://www.islamwb.com/books/Tafsir%20Ibn%20Kathir%2010%20Volumes.pdf
Does anyone have a parsed version of this or would like to take it on as a project? :)