honyomi recognize single letter rather than single word (Vertical text pdf)

dinosauria123 commented 7 years ago

Sorry, I am back again.

I found some vertical text pdf does not recognize correctly.

My own honyomi sever is here.

http://104.197.98.173/

I have added two vertical Japanese text pdfs, tate.pdf and tate2.pdf.

tate.pdf is generated from Google cloud vision OCR output and processed gcv2hocr and hocr-tools on Github.

tate2.pdf is generated from Powerpoint output, "Save as pdf".

On Adobe Reader, look up "テキスト" works correctly in both pdfs.

But on honyomi server, it can not find "テキスト in tate.pdf. I could find "テキスト" by input as "トスキテ".

In the case of tate2.pdf, honyomi can not find ”テキスト” as single word, but it could find single letter such as "テ", "キ".

If you can find the reason of this problem, please fix it.

dinosauria123 commented 7 years ago

I have added 3rd vertical text example, that made by Abbyy Filereader for scansnap.

The text is wrong recognized, but the top letter "縦" is goes to last letter in honyomi.

It seems to have some patterns, but I could not find other than the letter order is reversed when \n is included in recognized text.

I hope this will something helps...

May be this problem can solve to add reverse order query word, "カメラ" OR "ラメカ" in my web page because old Japanese horizontal texts writes right to left too.

ongaeshi commented 7 years ago

Honyomi use pdftotext for text. Maybe pdf2text can't parse all vertical text. (The parsed text can see the Honyomi's Text button.)

I will add a feature to edit a page text to honyomi. Please give me some time.

# It is a draft
$ honyomi edit 1 -p 2 "Change page 2 text on book 1."

dinosauria123 commented 7 years ago

Thank you for you reply.

I have additional questions to slightly modify honyomi to fit WWII era Japanese advertisement. I am not familiar to ruby, please give me suggestions.

As stated as above WWII era Japanese horizontal texts are mixed left-to-right writing style and right-to-left style.

I want to query keyword as (keyword OR keyword.reverse), how to modify the cords ?

 q = Query.new(params[:query])

I want to sort results as alphabetical order of the title (not file name), how to modify the cords ?

     if @params[:b] == '1'
      @header_info = %Q|<a href="/">#{@database.books.size}</a> books, <strong>#{@database.bookmarks.size}</strong> bookmarks.|
      render_bookmarks(@database.bookmarks, [{key: "timestamp", order: "descending"}])
    else
      @header_info = %Q|<strong>#{@database.books.size}</strong> books, <a href="/?b=1">#{@database.bookmarks.size}</a> bookmarks.|
      r = @database.books.map { |book|
        <<EOF
<li>#{book.id}: <a href="/v/#{book.id}">#{book.title}</a> (#{book.page_num}P)</li>

Thank you again for your all efforts to improve honyomi. I'm waiting your new version.

dinosauria123 commented 6 years ago

I fixed this problem by myself.

I used -raw option of pdftotext, I got correct order of the output text in the case of the vertical text.

The output of the vertical text include many CR and SPACE, I removed them.

The cord of modified pdf.rb is here.

          o, e, s = Open3.capture3("pdftotext -raw -f #{page_no} -l #{page_no} $
          break if s.exitstatus != 0

          text = File.read(outfile, encoding: Encoding::UTF_8)

          if String.method_defined? :scrub
            text = text.scrub('?')
          end

          result << text.gsub(/(\s)/,"")

I also have solved OR keyword search and sort by title of the book.

http://japanese-ww2-camera-ad.tk/

Thank you for your good software !

ongaeshi commented 6 years ago

Thanks you report and patch. (I'm sorry I did not reply.)

This is a very good solution. I did not know about the raw option, so I learned a lot.

Thank you for good solution and Honyomi's good use case!

ongaeshi / honyomi-web

honyomi recognize single letter rather than single word (Vertical text pdf) #3