highlight blocks in PDF

drdhaval2785 commented 8 years ago

Triggerred by #5 In case a user clicks on PDF page from a dictionary entry, let us highlight the corresponding column in PDF. Let us divide the pdf pages into 6 / 9 parts and display highlight on that part. In a sense, not strictly word highlight, but just approximate highlight to help user reach headword fast.

I know it is technologically a bit daunting, but I miss this feature a lot when I do correction submission. Majority of our dicts have page marks and some column marks and some even are encoded according to lines. This should help us a bit.

gasyoun commented 8 years ago

Let us divide the pdf pages into 6 / 9 parts and display highlight on that part.

I looked for a script, but could not find one.

In a sense, not strictly word highlight, but just approximate highlight to help user reach headword fast.

For not strictly just column highlighted would not help right? We know how many words are per column, maybe count (automatically) which is the word in the column (how many words above and how many words belllow) and tell you before you click it or on the same page where to look at?

funderburkjim commented 8 years ago

See this link 'semi-digitized' display for Wilson from the deprecated section of the Sanskrit-lexicon home page.

Some of the ideas there might well be applicable here. Essentially, a database was developed that had scan-page position information for each headword. And javascript code uses this information to highlight a section of the scanned image.

Is this the kind of thing you were thinking ?

A similar approach was done for STC.

gasyoun commented 8 years ago

Is this the kind of thing you were thinking ?

Why yes! Forgot about it!

aja

funderburkjim commented 8 years ago

In thinking about Jonathan's work on adding Greek to MW72 (a big task), I wondered if finding the place on the scanned page where the Greek text is shown might be one of the bottlenecks (in terms of number of minutes required to solve a case). If so, this kind of auto-highlighting might be a good solution.

gasyoun commented 8 years ago

might be one of the bottlenecks (in terms of number of minutes required to solve a case)

Only @jlreeder or @jmigliori could tell us.

jsonreeder commented 8 years ago

I think that this sort of highlighting could certainly benefit users.

When I was transcribing the Arabic, finding the word on the scanned page was one of the more time-consuming sub-tasks. When I realized that Jim had included the column number of the headword it vastly sped up the work.

I imagine that highlighting would help users locate the relevant definition in the same way that those column numbers helped me.

funderburkjim commented 8 years ago

@jlreeder Thanks for that info. From that, I conclude that mw72 should be the first dictionary to which we attempt to develop a highlighted scan, since there are many hundreds of Greek text snippets remaining to be done.

jmigliori commented 8 years ago

Seconded. Highlighting the headword would make a big difference in time spent looking. Would it be possible to even highlight the subheading, when applicable? Some of these entries run for pages.

funderburkjim commented 8 years ago

@jmigliori Hi, Jonathan - Does 'highlight the subheading' refer to the scanned image or to the GitHub listing?

gasyoun commented 8 years ago

Would it be possible to even highlight the subheading, when applicable? Some of these entries run for pages.

Would be too good to be true, @jmigliori

drdhaval2785 commented 8 years ago

On 1 Nov 2016 01:08, "funderburkjim" notifications@github.com wrote:

See this link from the deprecated section of the Sanskrit-lexicon home page.

Some of the ideas there might well be applicable here. Essentially, a database was developed that had scan-page position information for each headword. And javascript code uses this information to highlight a section of the scanned image.

Is this the kind of thing you were thinking ?

I will kill you Jim. I never knew we had something like this in 2008 display. And in 2014 display we retrogressed??? Please bring it back for whatever dictionaries we can. I would gain at least 100% speed up in correction submission.

jmigliori commented 8 years ago

Scanned image

On Oct 31, 2016, at 6:12 PM, funderburkjim notifications@github.com wrote:

Hi, Jonathan - Does 'highlight the subheading' refer to the scanned image or to the GitHub listing?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

gasyoun commented 8 years ago

I will kill you Jim.

You made my morning, Dhaval.

I would gain at least 100% speed up in correction submission.

That's too quick. Jim's afraid he will not manage to implement the changes :yin_yang:

funderburkjim commented 8 years ago

I found an interesting Mozilla Javascript library, call pdf.js.
This renders a pdf in javascript.

The interesting thing, from our point of view, is that it is possible to do other things with Javascript to the rendered image. The rendering is done in a Canvas, so a program can draw other things in the canvas. For instance, one can draw a (transparent) rectangle anywhere in the canvas.

Assuming the above comment is right, then from our point of view the problem boils down to deciding the coordinates of that rectangle within the canvas that is rendering the pdf page.

Will focus on the specific problem of drawing that rectangle on a specific instance of Greek text in mw72. If a useable solution can be found there, then maybe it can extend to other useful situations that Dhaval is envisioning. Fingers crossed!

gasyoun commented 8 years ago

Fingers crossed!

:zap:

gasyoun commented 7 years ago

maybe it can extend to other useful situations that Dhaval is envisioning

We believe in you, Jim.

drdhaval2785 commented 3 years ago

I want this integrated well in as many dictionaries as possible, @funderburkjim .

funderburkjim commented 3 years ago

@drdhaval2785 Will let you tackle this.

One method I tried before, to help Jonathan do Greek in mw72, is illustrated by this url https://www.sanskrit-lexicon.uni-koeln.de/scans/MW72Scan/2014/pywork/correctionwork/webline/index.php?linenum=091461

The way this is supposed to work as that a line number (in this case 091461) is associated with an approximate position on a particular page of the scans for mw72. An attempt is made to draw a 'green' rectangle on the loaded image at this position on the page, and also the window showing the image is scrolled to that position.

The number in this case is a particular line-number in mw72.txt.

This is working in my Chrome browser. See https://github.com/sanskrit-lexicon/GreekInSanskrit/issues/16 for a discussion. In this discussion, there was mentioned some performance variation among browsers.

Where to scroll and draw the green rectangle are determined by a file pagelines.txt. One line of this file is: 0401-a:91426,91503 which says Page401, col. a of mw72 scans corresponds to lines 91426-91503 of mw72.txt. Our line number 91461 is in this range, so we get an estimate of the location on the page. A program pageslines.py precomputed pagelines.txt by some sort of estimation based on character counts in pages of mw72.txt

Although the image you see at the above link is a png format when you do a 'save-image-as', the page starts out as the usual pdf format used by mw72 images. There is a lot of Javascript manipulation from some 3rd party sources that do the heavy lifting.

A conjecture is that a process similar to this for mw72 could help you achieve what you're aiming for. For example in place of 'pagelines.txt', you could have pagelnums.txt, where page 401-a would correspond to a certain range of L codes.

'servepdf.php' programs could be adapted to use such an alternate display.

This would likely be a technically challenging task.

gasyoun commented 10 months ago

@drdhaval2785 remains still up to you. Nothing I can help here, other than announcing Sanskrit volunteer coders wanted at Auroville.

sanskrit-lexicon / COLOGNE

highlight blocks in PDF #90