koreader / koreader

An ebook reader application supporting PDF, DjVu, EPUB, FB2 and many more formats, running on Cervantes, Kindle, Kobo, PocketBook and Android devices
http://koreader.rocks/
GNU Affero General Public License v3.0
16.81k stars 1.26k forks source link

No spaces in highlights with Greek text. #7928

Closed ichnilatis-gr closed 3 years ago

ichnilatis-gr commented 3 years ago

Issue

The text of the highlights containing Greek text has no spaces between the words. Could this be fixed, please?

Steps to reproduce

When I highlight a Greek text, even in an epub file, this text, in highlight display, hasn't any space between the words.

Thank you!

poire-z commented 3 years ago

Might depend on your book (and I can't guess what's happening), but no issue for me on some sample greek text:

image

image

image

You might want to hit on some text selection that ends up bad: "View HTML", and switch to debug>unicode debug view (hit that multiple times) to see what kind of characters you have for spaces: image

These <U+20> are the expected correct code for a normal space: image

ichnilatis-gr commented 3 years ago

My fault. The problem seems to exist only in pdf files.

poire-z commented 3 years ago

OK :) Then, it depends on how the PDF is made:

I'm sure you'll get it to work with some other kind of Greek PDFs. We can't do much with crappy PDFs.

ichnilatis-gr commented 3 years ago

So, I can do nothing with this, it depends on the file (if I understand correctly)...

Thank you!

poire-z commented 3 years ago

You understand correctly. (May be we could have much clever code for text extraction from PDF, but nobody has worked on this side of the code for years :/)

mergen3107 commented 3 years ago

@ichnilatis-gr You can try force OCR. See here: https://github.com/koreader/koreader/wiki/Dictionary-support#dictionary-lookups-in-scanned-pages You need to add Greek (grc I suppose) to the deafults.lua file as shown, and download grc tessdata for it. Then in KOReader its in the bottom menu, far right tab.

OCR will try to get the text layer from the image.

ichnilatis-gr commented 3 years ago

I've already done it. Thank you a lot. However, even with force OCR, the problem remains. Βesides, it is not always accurate, when I try to chose a word or a phrase.

ichnilatis-gr commented 3 years ago

I can't understand why in the same pdf the text of the highlights with English text has spaces, while the one with Greek text doesn't have.

poire-z commented 3 years ago

And do you have other greek PDFs where it works ? Coulld be that in this particular PDF, the english parts have some text layer, while the greek parts don't.

ichnilatis-gr commented 3 years ago

It's a scanned pdf with Greek and English OCR, that I've done. When I highlight a text both with Greek and English words, the Greek words don't have spaces between them, while the English ones have. The same happens with other pdf files too, either scanned or generated.

poire-z commented 3 years ago

The same happens with other pdf files too, either scanned or generated.

My question was: does the same not happens with at least one other pdf :) To see if there's something document specific, or if it's a global issue.

Could be related to this: https://github.com/koreader/koreader/blob/f411035d7dff530428beb2e5abda14b531e9cd34/frontend/document/koptinterface.lua#L886-L898 You could try on line 896 to replace and "" or " " with and " " or " " (add a apace in the empty "" after and. (But that's some code from 7 years ago, and nobody will dare touching it and be held responsible and be forced to adopt this abandonned code :)

ichnilatis-gr commented 3 years ago

The same goes for all other pdf files I've tested so far, either scanned or generated.

poire-z commented 3 years ago

Just curious: do these PDF with greek text have lines "justified" ? I guess that PDF, for justification, must use a box for each word, and position each box correctly to get text justification. But it may not have to do that for left aligned lines. Do you have some PDFs where some bit of the text is not justified (headings/titles should not, but if you find some paragraphs, even better), and how does it work on these ?

ichnilatis-gr commented 3 years ago

Usually the generated pdf files, I've tested, are not "justified". Some of these files contain both Greek an d English words. In those the problem exists only for the Greek text. The scanned pdf files are usually justified.

I haven't tried yet your proposal, that you mentioned above, since I don't have the device with me right now.

ichnilatis-gr commented 3 years ago

@ poire-z Finally, I made the replacement you proposed above and it worked! But I notice that there is still no space after a letter with accent. Any idea for this case?

Thank you for your help!

May I edit this replacement in KOReader?

poire-z commented 3 years ago

May I edit this replacement in KOReader?

No ! :) This was just a test to see if this bit of code is really where things should be happening. It would need more thinking.

poire-z commented 3 years ago

Got some issue too with french text: words ending with é would be stuck to the next one. Rather than trying to guess if boxes are words and if the last char is a latin or greek char or some punctuation that at end of word would be followed by a space (unlike Chinese ideograms), we may try to trust the spacing between boxes: if boxes are not stuck together, assume it means there's some spacing, and add a space. Got some issues also with end of lines and hyphenation (which is sometimes detected as a real unicode hyphen and not an ascii minus).

Came up with this:

--- a/frontend/document/koptinterface.lua
+++ b/frontend/document/koptinterface.lua
@@ -888,18 +888,42 @@ function KoptInterface:getTextFromBoxes(boxes, pos0, pos1)
         -- insert line words
         local j0 = i > i_start and 1 or j_start
         local j1 = i < i_stop and #boxes[i] or j_stop
+        local line_first_word_seen = false
+        local prev_word_end_x
         for j = j0, j1 do
             local word = boxes[i][j].word
             if word then
-                -- if last character of this word is an ascii char then append a space
-                local space = (word:match("[%z\194-\244][\128-\191]*$") or j == j1)
-                               and "" or " "
-                line_text = line_text..word..space
+                if not line_first_word_seen then
+                    line_first_word_seen = true
+                    if #line_text > 0 then
+                        if line_text:sub(-1) == "-" then
+                            -- Previous line ended with a minus.
+                            -- Assume it's some hyphenation and discard it.
+                            line_text = line_text:sub(1, -2)
+                        elseif line_text:sub(-2, -1) == "\xC2\xAD" then
+                            -- Previous line ended with a hyphen.
+                            -- Assume it's some hyphenation and discard it.
+                            line_text = line_text:sub(1, -3)
+                        else
+                            -- No hyphenation, add a space (might be not welcome
+                            -- with CJK text, but well...)
+                            line_text = line_text .. " "
+                        end
+                    end
+                end
+                local box = boxes[i][j]
+                if prev_word_end_x then
+                    -- If the space between previous word box and this word box
+                    -- is larger than 2% of box height, assume there is some
+                    -- space between these boxes, and add a space
+                    if box.x0 - prev_word_end_x > (box.y1 - box.y0) * 0.02 then
+                        word = " " .. word
+                    end
+                end
+                line_text = line_text .. word
+                prev_word_end_x = box.x1
             end
         end
-        -- append a space at the end of the line unless its a hyphenated word
-        line_text = line_text .. " "
-        line_text = line_text:gsub("- $", "")
         -- insert line box
         local lb = boxes[i]
         if i > i_start and i < i_stop then

Tested with French and Chinese, looks OK. Thought it would work with some Arabic PDF, but the boxes are all shuffled with arabic that we can't really do anything nice... @ichnilatis-gr : can you replace your frontend/document/koptinterface.lua with this one koptinterface.lua.txt and see how it works with various kind of PDFs?

ichnilatis-gr commented 3 years ago

But I notice that there is still no space after a letter with accent.

Sorry, I was wrong about that. It seems that this problem exist in a specific scanned pdf, where the some spaces are too small and the OCR engine didn't recognize them. I tested your first solution on other pdf files and it worked, even after letters with accent.

I will test the new koptinterface.lua file and I will inform you about it. Thank you!

poire-z commented 3 years ago

Updated version: koptinterface.lua.txt

Was a bit afraid I would have messed it up for Chinese/Japanese, so added some specific stuff. Dunno how this could mess some other cases, and up to which point we expect the text from past highlights to stay the same (when comparing highlights, deleting... or stuff like that)...

--- a/frontend/document/koptinterface.lua
+++ b/frontend/document/koptinterface.lua
@@ -875,6 +875,7 @@ Get text and text boxes between `pos0` and `pos1`.
 --]]
 function KoptInterface:getTextFromBoxes(boxes, pos0, pos1)
     if not pos0 or not pos1 or #boxes == 0 then return {} end
+    local isCJKChar = require("util").isCJKChar
     local line_text = ""
     local line_boxes = {}
     local i_start, j_start = getWordBoxIndices(boxes, pos0)
@@ -888,18 +889,62 @@ function KoptInterface:getTextFromBoxes(boxes, pos0, pos1)
         -- insert line words
         local j0 = i > i_start and 1 or j_start
         local j1 = i < i_stop and #boxes[i] or j_stop
+        local line_first_word_seen = false
+        local prev_word
+        local prev_word_end_x
         for j = j0, j1 do
             local word = boxes[i][j].word
             if word then
-                -- if last character of this word is an ascii char then append a space
-                local space = (word:match("[%z\194-\244][\128-\191]*$") or j == j1)
-                               and "" or " "
-                line_text = line_text..word..space
+                if not line_first_word_seen then
+                    line_first_word_seen = true
+                    if #line_text > 0 then
+                        if line_text:sub(-1) == "-" then
+                            -- Previous line ended with a minus.
+                            -- Assume it's some hyphenation and discard it.
+                            line_text = line_text:sub(1, -2)
+                        elseif line_text:sub(-2, -1) == "\xC2\xAD" then
+                            -- Previous line ended with a hyphen.
+                            -- Assume it's some hyphenation and discard it.
+                            line_text = line_text:sub(1, -3)
+                        else
+                            -- No hyphenation, add a space (might be not welcome
+                            -- with CJK text, but well...)
+                            line_text = line_text .. " "
+                        end
+                    end
+                end
+                local box = boxes[i][j]
+                if prev_word then
+                    -- A box should have been made for each word, so assume
+                    -- we want a space between them, with some exceptions
+                    local add_space = true
+                    local box_height = box.y1 - box.y0
+                    local dist_from_prev_word = box.x0 - prev_word_end_x
+                    if prev_word:sub(-1, -1) == " " or word:sub(1, 1) == " " then
+                        -- Already a space between these words
+                        add_space = false
+                    elseif dist_from_prev_word < box_height * 0.05 then
+                        -- If the space between previous word box and this word box
+                        -- is smaller than 5% of box height, assume these boxes
+                        -- should be stuck
+                        add_space = false
+                    elseif dist_from_prev_word < box_height * 0.8 then
+                        if isCJKChar(prev_word:sub(-3, -1)) and isCJKChar(word:sub(1, 3)) then
+                            -- Two CJK chars whose spacing is not large enough
+                            -- (we checked the 3 UTF8 bytes that CJK chars must be,
+                            -- no need to split into unicode codepoints)
+                            add_space = false
+                        end
+                    end
+                    if add_space then
+                        word = " " .. word
+                    end
+                end
+                line_text = line_text .. word
+                prev_word = word
+                prev_word_end_x = box.x1
             end
         end
-        -- append a space at the end of the line unless its a hyphenated word
-        line_text = line_text .. " "
-        line_text = line_text:gsub("- $", "")
         -- insert line box
         local lb = boxes[i]
         if i > i_start and i < i_stop then
ichnilatis-gr commented 3 years ago

Dear poire-z, I tested the first koptinterface.lua and it seem that it works properly. But there is a problem that I don't know if it has any relation with this. If I delete a highlight, when I reopen the file, the text is still highlighted, but it isn't displayed in the highlights list. I had noticed this problem other times too. But now it happens with all the files in which I tested the highlighting for the purposes of this issue

ichnilatis-gr commented 3 years ago

I've just tested the new koptinterface.lua file. The text of the highlights is rendered correctly (with spaces), but the deleted highlights appear again when I reopen the file, without being listed in the highlights list.

poire-z commented 3 years ago

If I delete a highlight, when I reopen the file, the text is still highlighted, but it isn't displayed in the highlights list. I had noticed this problem other times too. But now it happens with all the files in which I tested the highlighting for the purposes of this issue

I guess this answers my Dunno how this could mess some other cases, and up to which point we expect the text from past highlights to stay the same (when comparing highlights, deleting... or stuff like that)... :/

ichnilatis-gr commented 3 years ago

It seems so... 😞

poire-z commented 3 years ago

So: A: how do you delete the highlight : from the bookmark lists, or by tap on the highlight? B: is its drawing removed, and it's only on re-open that it's there again ? C: do you have "Document > Write highlight into document (PDF)" checked ? Are these the issue, or not? D: you can check in a metadata.epub.lua for the highlights: there should be in 2 slots: "bookmarks" and "highlight" is it really there that there are not deleted in one of these 2 slots ?

poire-z commented 3 years ago

Do you have/get this WARN message in your crash.log when you happen to half-delete one ? : https://github.com/koreader/koreader/blob/bfda4bfc8d3ee8235dd106b3c563ada8cd84a233/frontend/apps/reader/modules/readerhighlight.lua#L1323-L1328

ichnilatis-gr commented 3 years ago

A: Usually by tap on the highlight, but I also tested it deleting them from the bookmarks list, with the same result. B: Yes, exactly. C: Yes, it is checked. D: I guess that you mean the metadata.pdf.lua of the file. In this, the value for highlights is 0.

I do not know if it matters to mention that, when I open the file with a pdf program on my PC, the highlights are displayed.

poire-z commented 3 years ago

OK, so can you check if the issue happens with Write highlights into PDF > [x] Disable. Might be a bit tedious, but you'd need to get the old version, do some highlights, then put back my koptinterface.lua, and delete again. Just so we get to understand where is the issue(s).

ichnilatis-gr commented 3 years ago

When Write highlights into PDF > [x] Disable , the highlights are deleted in both cases, without and with your koptinterface.lua.

poire-z commented 3 years ago

(ok, no worry :)

the highlights are deleted in both cases, without and with your koptinterface.lua.

But even "across" (highlights made without, so with some spaces missing in their text - when putting back my koptinterface.lua, do they get deleted) ?

poire-z commented 3 years ago

Can't reproduce with Write highlights, but all this might be really document dependant...

Anyway, another patched file frontend/apps/reader/modules/readerhighlight.lua readerhighlight.lua.txt

--- a/frontend/apps/reader/modules/readerhighlight.lua
+++ b/frontend/apps/reader/modules/readerhighlight.lua
@@ -1292,11 +1292,14 @@ function ReaderHighlight:onUnhighlight(bookmark_item)
             -- pos0 are tables and can't be compared directly, except when from
             -- DictQuickLookup where these are the same object.
             -- If bookmark_item provided, just check datetime
-            if highlight.text == sel_text and (
-                    (datetime == nil and highlight.pos0 == sel_pos0) or
-                    (datetime ~= nil and highlight.datetime == datetime)) then
-                idx = index
-                break
+            if ( (datetime == nil and highlight.pos0 == sel_pos0) or
+                 (datetime ~= nil and highlight.datetime == datetime) ) then
+                -- As we may have fixed space handling in extracted text over
+                -- the years, check text identities with spaces removed
+                if highlight.text:gsub(" ", "") == sel_text:gsub(" ", "") then
+                    idx = index
+                    break
+                end
             end
         end
     else -- page is a xpointer
ichnilatis-gr commented 3 years ago

But even "across" (highlights made without, so with some spaces missing in their text - when putting back my koptinterface.lua, do they get deleted) ?

I made a highlight with the old koptinterface.lua, with Write highlights into PDF > [x] Disable, then I closed the file, I put your koptinterface.lua, I deleted the highlight, I closed the file and, when I reopened it, the highlight was deleted.

ichnilatis-gr commented 3 years ago

With your readerhighlight.lua and Write highlights into PDF > Always the problem remains, the highlights are not deleted from the text.

poire-z commented 3 years ago

You haven't tell me if you see some WARN unhighlight: bookmark_item not found among highlights in your crash.log.

I don't really get why you'd still get the "written into the pdf itself" highlight not removed: they are just referenced by the coordinates of the boxes, and don't care about the text (that we may have changed by adding spaces)...

Can you try to find a free PDF that you can share, make 2 highlights (a few words on a single line, no hyphenation) with the old code - put the new code, then delete one of these 2 highlights - and see if you get the problem. If yes, can you upload here a zip with the PDF + it's .sdr/metadata.pdf.lua - so I can understand what's really the state of things ?

ichnilatis-gr commented 3 years ago

You haven't tell me if you see some WARN unhighlight: bookmark_item not found among highlights in your crash.log.

I can't find something like that.

ichnilatis-gr commented 3 years ago

Can you try to find a free PDF that you can share, make 2 highlights (a few words on a single line, no hyphenation) with the old code - put the new code, then delete one of these 2 highlights - and see if you get the problem.

Write highlights into PDF disable or allways? With the old or the new readerhighlight.lua?

poire-z commented 3 years ago

With Write highlights into PDF> always. (As I understand there is zero problem when it is set to Disable, right?) With the old readerhighlight.lua.

ichnilatis-gr commented 3 years ago

Can you try to find a free PDF that you can share, make 2 highlights (a few words on a single line, no hyphenation) with the old code - put the new code, then delete one of these 2 highlights - and see if you get the problem.

There is no problem. The highlight is deleted.

poire-z commented 3 years ago

So, you can't reproduce what you get with your real books with real highlights ? :/ Can you share a not too old, not too full of highlights, book of yours where you can reproduce the problem ?

ichnilatis-gr commented 3 years ago

The problem about the non-deleted highlights exists with the new koptinterface.lua. I haven't a book with such highlights. I've done only the tests you asked me to do. May I give you some pages with Greek text to make your tests?

poire-z commented 3 years ago

May I give you some pages with Greek text to make your tests?

Yes.

The problem about the non-deleted highlights exists with the new koptinterface.lua. I haven't a book with such highlights. I've done only the tests you asked me to do.

The test was: new free PDF, old koptinterface, highlight 2 bits of text, quit. Update to new koptinterface.lua, delete one of the 2 highlights, quit. And see if you still see the deleted highlight. If yes, it's reproducible, and I'd like that pdf and its medatada.pdf.lua. If no, it's not 100% reproducible and harded to undestand, and may depend on the document.

ichnilatis-gr commented 3 years ago

Can you try to find a free PDF that you can share, make 2 highlights (a few words on a single line, no hyphenation) with the old code - put the new code, then delete one of these 2 highlights - and see if you get the problem.

There is no problem. The highlight is deleted.

I meant that I did the test and the highlight was deleted.

ichnilatis-gr commented 3 years ago

If no, it's not 100% reproducible and harded to undestand, and may depend on the document.

During this discussion, I've made the various tests you asked me to do in various pdf files and not in a specific file only.

ichnilatis-gr commented 3 years ago

May I give you some pages with Greek text to make your tests?

Yes.

Extracted pages.pdf

Extracted pages from Ντοστογιέβσκι.pdf

poire-z commented 3 years ago

Can't reproduce the issue on your 2 PDFs doing this: old koptinterface, highlight some bit of text (single line, multiple lines), quit. Update to new koptinterface.lua, delete highlight, quit.

The only thing I notice (on our colorfull emulator) is that when I delete the highlight, the gray (drawn by us) highlighting is removed, but some yellow (drawn by MuPDF) stays. If I quit and reload the book, this yellow highlight is no longer there, and nothing is highlighted. Is that what you were noticing all along ? It happened before the koptinterface changes too. And somehow it makes sense: we only write into the PDF on exit, so MuPDF still think the internal highlight is there, and draws it. If that's an issue to fix, it's a totally other one to fix elsewhere :)

ichnilatis-gr commented 3 years ago

Something that I had in my mind to tell you: I have the impression that it is as if two highlights (two layers) are being created. (I don't know what exaclty MuPDF is... 😃). It has happened to me other times to delete a highlight and, nevertheless, remain in the file. I then had to delete these (yellow) highlights with a pdf program on my PC...

poire-z commented 3 years ago

I have the impression that it is as if two highlights (two layers) are being created. (I don't know what exaclty MuPDF is.)

And your impression is right :) That's what is happening. MuPDF is the PDF rendering engine, the thing that reads the PDF file and draws the page. And there is our "koreader frontend" code, that draws some little bits on the original PDF page (it obtained from MuPDF), like highlights and the bookmark dogear.

When you "write highilghts" into PDF, the highlights become part of the PDF - and MuPDF, on the next re-opening will draw them (what I see in yellow). Then, our frontend will still draw its own idea of the highlights in gray (so, possibly over the yellow, making it 2 layers).

When you later delete a highlight, KOReader will not draw its gray highlight - but MuPDF will still render the highlights in yellow. When you quit the document, we write/delete the highlights from the PDF. On next re-opening, MuPDF will not see the highlights, and won't draw them in yellow.

Is this really what you've been talking about when considering the problem with deleted highlight ? (This behaviour is indeed a bit bad/strange.) Or is there still something else that we need to understand ? I thought the PDF/Yellow highlights were really never ever deleted - but they may still be deleted on next re-opening, right ?

If you had in the past yellow highlights that stuck (and needed to be deleted on a PC), then, this is another issue that may not be caused by my koptinterface.lua changes - or do my changes make the issue worse ?

ichnilatis-gr commented 3 years ago

In the past the problem (non-deleted highlights from the text, but only from the bookmark list) only occurred a few times and I can't reproduce it. I never understood why it happened. I only can assume that it was due to quick actions, taps or something like that... Now, with your koptinterface.lua (that solves the problem with the spaces), the problem occurres always, as you can see.

poire-z commented 3 years ago

the problem occurres always, as you can see.

The only thing I can see is that the yellow stuff stays, but only until the next re-opening. And I can see that this happens too with the original koptinterface.lua .

poire-z commented 3 years ago

the yellow stuff stays, but only until the next re-opening.

Actually, the writen-into-the-pdf highlights are deleted immediately. But they stay on the screen because we cached the PDF page bitmap with the yellow highlights, and we re-use that cache. If I switch Zoom to fit width/height, the yellow highlights are not there on the newly drawn pages - but they are still on the cached page I get when switching back to the original zoom.

NiLuJe commented 3 years ago

On mobile right now, but that makes sense, and I don't remember off the top of my head if we have a method to invalidate a specific page cache...