Open dinosauria123 opened 7 years ago
I have added 3rd vertical text example, that made by Abbyy Filereader for scansnap.
The text is wrong recognized, but the top letter "縦" is goes to last letter in honyomi.
It seems to have some patterns, but I could not find other than the letter order is reversed when \n is included in recognized text.
I hope this will something helps...
May be this problem can solve to add reverse order query word, "カメラ" OR "ラメカ" in my web page because old Japanese horizontal texts writes right to left too.
Honyomi use pdftotext for text. Maybe pdf2text can't parse all vertical text. (The parsed text can see the Honyomi's Text
button.)
I will add a feature to edit a page text to honyomi. Please give me some time.
# It is a draft
$ honyomi edit 1 -p 2 "Change page 2 text on book 1."
Thank you for you reply.
I have additional questions to slightly modify honyomi to fit WWII era Japanese advertisement. I am not familiar to ruby, please give me suggestions.
I want to query keyword as (keyword OR keyword.reverse), how to modify the cords ?
q = Query.new(params[:query])
if @params[:b] == '1'
@header_info = %Q|<a href="/">#{@database.books.size}</a> books, <strong>#{@database.bookmarks.size}</strong> bookmarks.|
render_bookmarks(@database.bookmarks, [{key: "timestamp", order: "descending"}])
else
@header_info = %Q|<strong>#{@database.books.size}</strong> books, <a href="/?b=1">#{@database.bookmarks.size}</a> bookmarks.|
r = @database.books.map { |book|
<<EOF
<li>#{book.id}: <a href="/v/#{book.id}">#{book.title}</a> (#{book.page_num}P)</li>
Thank you again for your all efforts to improve honyomi. I'm waiting your new version.
I fixed this problem by myself.
I used -raw option of pdftotext, I got correct order of the output text in the case of the vertical text.
The output of the vertical text include many CR and SPACE, I removed them.
The cord of modified pdf.rb is here.
o, e, s = Open3.capture3("pdftotext -raw -f #{page_no} -l #{page_no} $
break if s.exitstatus != 0
text = File.read(outfile, encoding: Encoding::UTF_8)
if String.method_defined? :scrub
text = text.scrub('?')
end
result << text.gsub(/(\s)/,"")
I also have solved OR keyword search and sort by title of the book.
http://japanese-ww2-camera-ad.tk/
Thank you for your good software !
Thanks you report and patch. (I'm sorry I did not reply.)
This is a very good solution. I did not know about the raw option, so I learned a lot.
Thank you for good solution and Honyomi's good use case!
Sorry, I am back again.
I found some vertical text pdf does not recognize correctly.
My own honyomi sever is here.
http://104.197.98.173/
I have added two vertical Japanese text pdfs, tate.pdf and tate2.pdf.
tate.pdf is generated from Google cloud vision OCR output and processed gcv2hocr and hocr-tools on Github.
tate2.pdf is generated from Powerpoint output, "Save as pdf".
On Adobe Reader, look up "テキスト" works correctly in both pdfs.
But on honyomi server, it can not find "テキスト in tate.pdf. I could find "テキスト" by input as "トスキテ".
In the case of tate2.pdf, honyomi can not find ”テキスト” as single word, but it could find single letter such as "テ", "キ".
If you can find the reason of this problem, please fix it.