quran / quran.com-images

images using fonts from King Fahed Complex / qurancomplex.org
http://quran.com
390 stars 125 forks source link

How to derive the actual number of words per line for each chapter? #36

Open AlGantori opened 4 years ago

AlGantori commented 4 years ago

If I understand the main page description these "scripts" render from a font an image of the page then builds the rectangle bounds for each words (glyphs) generated (Correct?)

Does it also build one line bitmap at a time for a 15 lines/rows per page madina mushaf?

It sounds overly complex if all I want is the word count per line for each chapter. Something like the following (showing word count for Fatiha, Baqara,)

   "1": [4,5,4,4,4,5,3],
   "2": [7,5,4,8,6,6,
        9,9,7,9,8,8,9, 8,9,10,9,10,8,7, 7, ..... ],
ahmedre commented 4 years ago

yes, your understanding is correct. and yes, it builds one line at a time. for what you want to do, i'd download and get the database from this repo and then get this data with a query (or with a script that just does this for each page). if i recall, the table here should contain the line information as well.

AlGantori commented 4 years ago

Are you sure this kind of data is not already available in some XML/JSON resource?

I have done indirectly some node.js based development but I don't recognize the commands installation notes like the following:

ppm install dmake
ppm install dbd-mysql
ppm install yaml

are these expected to be executed inside some CLI? or some linux distro? Thanks for helping out because at this point I am clueless. I am running in Windows7

ahmedre commented 4 years ago

you don't need to do any of those commands nor run this script itself - just download the database and import it and write a script yourself.

AlGantori commented 4 years ago

By database you mean download the sql folder in this repo.

I have MySQL Workbench, it's a beast I never got acquainted with all of its terms Open Model, ??? It seems oriented to open dbs over some network connection, I am having hard time making it open a local file. It managed to open schema.mwb and throws me into the err diagram mode, I want to see the tables and data.

Which of these files should I be attempting to open?

image

Would you suggest a better tool than MySQL Workbench 5.2.44 CE?

By me writing scrips you mean write SQL queries to retrieve info, perhaps from glyph_line_page table?

Thank you for holding my hands thru this.

AlGantori commented 4 years ago

image

Will tajweed markings (eg. small-meen etc..) be appearing as separate rows in this table or lumped with the previous word (as a single glyph)?

I feel like this is terrible, I would have to query and group count on ayah_number and minus one for the aya_number (hindi thingy) to get my word count???

I have a feeling I am going about this the wrong/difficult way

AlGantori commented 4 years ago

image

This would be its data, matching the 7 words + 2 tajweed markers + 1 verse-number = 10 tokens

image

How can I derive/detect that glyph_id = 264 is a verse number, I do not want to count???

AlGantori commented 4 years ago

Specifically for Page#2 this database is about this particular layout

image

Matching query

SELECT COUNT(line_number) FROM `glyph_page_line` WHERE page_number = 2 GROUP BY line_number;

The raw/net count of tokens per line follows:

image

I happen to be working with the Tajweed version page2 is a bit different, that's ALRIGHT I will handle that.

Again my current road block is detecting a token is a verse number???

ahmedre commented 4 years ago

the glyph table will tell you what "type" the glyph is - so you can exclude the ayah markers that way.

AlGantori commented 4 years ago

Wow I can't believe I am doing a 3 way join to get this, it appears that all verse-numbers are typed as "end"

image

image

Mission almost accomplished !!! ALLAHU AKBAR !!!

ahmedre commented 4 years ago

awesome al7amdulillah! make sure to not include other things like pauses (so just include words).

AlGantori commented 4 years ago