Issues with extracted text - extra spaces, incorrect characters etc.

nasheqlbrm commented 1 year ago

Take doii-rsb-0001-100-01.txt we see that the text extracted using PyTesseract is as follows:

ォ ョ ン が 行 は 刀 英 貸 が 全 岡 的 に ボイラ コ ッ ト さ れ 、 輝 生 も 反 政府 的 ス ョ ー ガ ン を 掃 げ 大 示威 軍 動
を 展開 し 、 ボ ー ス 了 料 放 を 叫ぶ に 至 つ た 。

系 く 大 上 の 反 英 象 運 の 謝 ま る と 共に 七 月 六 日 回 民 會 議 委員 命 は 五 日 に 豆 る 全 識 の 結果 印度 問
題 を 解決 する に は 完全 旨 立 の 外 方 途 な き 問 を 決議 し 、 中 央 立法 議 儲 員 が 完全 に 信任 する 印度 負
立 他 政 府 佑 織 を 必要 と し た 。

成 委員 命 の 決 難 は 七 月 二 十 七 日 プ ー ナ 1 に 開か れる 購 殿 合議 大 剣 に 於 て 可決 せら れる で あら
うぅ う 。 然 し | 方 に 終 て 英 財 は 印度 人 の 兵士 、 技 術 家 、 人 夫 働 者 の 張 制 役 集 を 行 、、 足 が 反 圭 を 強 礎
する 箇 全 園 に 互 り 示 捕 家宅 揚 索 が 行 は れ 、 本 印 を 不安 に 了 路 て ゐる 。

論 今 後 の 情 益 は 験 言 を 計 さ な い が 、 英 剛 に し て 印度 の 完全 手 立 を 認め な い 限 り 前 六 の 如く
武力 革命 の 可能 性 も ある わけ で ある 、 只 ベ 問題 は 今日 の 印度 人 が 何等 武器 を 有 し な ぁ い 牙 で ある が
世界 情勢 の 髪 化 に 令 つ て は 印度 も 武器 を 有 し 得る の で ある 。 釣 ほ 現在 西北 園 境 州 の トラ イプ の
み は 武 竣 解除 を 受け て 居 ち けず 今 表 大 隊 急 勤 後 各地 に 反 英 抗 当 を 行 つ て わる が 、 是 等 と 連絡 す
る 医 に 依 つ て も 武器 を 獲得 し 得る で あら ろう 。

However if we use Google Translate on the image then the extracted text can be seen to be:

ションが行は英貨が全国的にボイコットされ、生も反政府的スローガンを掲げ大示威運動 を展開し、ボース釋放を叫ぶに至つた。 斯く大衆の反英氣運の高まると共に七月六日國民會議委員会は五日に亘る會議の結果印度問 題を解決するには完全獨立の外方途なき事を決議し、 中央立法議員が完全に信任する印度獨 立政府組織を必要とした。 此委員會の決議は七月二十七日プーナーに開かれる國民會議大会に於て可決せられるであら う。然し一方に於て英國は印度人の兵士、技術家勞働者の強制徴集を行い、是が反對を強歴 1 する爲全國に亙り逮捕家宅捜索が行はれ、全印を不安に陥れてゐる。 -182- 勿論今後の情勢は豫言を許さないが、英國にして印度の完全獨立を認めない限り前述の如く 武力革命の可能性もあるわけである、唯問題は今日の印度人が何等武器を有しない點であるが 世界情勢の變化に依っては印度も武器を有し得るのである。現在西北国境州のトライプの みは武装解除を受けて居らず今次大戦勃發後各地に反英抗争を行つてゐるが、是等と連絡す る事に依っても武器を獲得し得るであらう。

We can see that the first set of text has:

extra spaces between characters
also some errors in extraction can be seen. For example, compare the first three characters from each (should be ション instead of ォョン )

These lead to a substantial drop in the quality of the translation. Pasting the first bit of text into Google Translate results in gibberish namely,

In the early days of the war, swordsmen were lent to the entire Oka area, and Teruo also launched a great military demonstration to wipe out the anti-government gangs.
He then shouted, "Freedom from Bose."

The results of the omniscience that life is born in the fifth day, along with the gratitude of the great anti-British fortunes of the people of July 6th.
In order to solve these problems, we must pass resolutions on extremely difficult questions that are entirely on the point of view, and have the full confidence of the members of the central legislative body in India.
The other government needed Yuori.

The decision of the members of the committee will be passed at the Great Sword Conference to be held on July 27th, Pune 1.
Woohoo. However, in the end, the British Empire established a system for Indian soldiers, engineers, and laborers, and their feet strengthened the anti-Kei movement.
All the gardens were lined with ropes showing signs, and people were anxiously making their way to the final seal.

The argument is that the future interests will not be considered as an experiment, but as long as Yinggang does not accept India's complete solution, it will be the same as in the previous 6.
There is a possibility of an armed revolution, but the only problem is that today's Indians do not have any weapons.
Given the changing world situation, India could also possess weapons. Currently fishing for tripe in the Seihoku region.
My army has been dismantled, and I am currently on duty in the front battalion and carrying out anti-British resistance in various places, but I am contacting you.
Depending on the doctor, he could also acquire weapons, but he would not be able to do so.

Contrast this to the results when we paste in the second Japanese text (I have slighlty altered the translation by adding line breaks):

During this period, British currency was boycotted nationwide, and students raised anti-government slogans and staged a grand demonstration movement, calling for the release of Boris.
As anti-British sentiment among the masses grew, the National Assembly Committee on July 6th, after a five-day meeting, resolved that the only way to solve the Indian problem was to leave India completely alone. They needed an independent government organization in India that they trusted.
The resolution of this committee will be passed at the National Assembly meeting to be held in Pune on July 27th. However, on the other hand, the British forcibly conscripted Indian soldiers and engineers, who were forced to rebel. .
-182-
Of course, the future situation does not allow any criticism, but unless the British recognize India's complete independence, there is a possibility of an armed revolution as mentioned above.The only problem is that today's Indians do not have any weapons at all. The point is that depending on changes in the world situation, India may also have weapons. Currently, only the girl in Tripe in the northwestern border province has not been disarmed and is carrying out anti-British protests in various places after the outbreak of the current war, but she has been able to acquire weapons even though she has contacted Koreto. I hope to get it.

There are some issues here too (Boris instead of Bose) but the second translation reads much better.

nasheqlbrm commented 1 year ago

Safari -> Develop -> Show Web Inspector
Then search for some uncommon word that appears in the translation, I used conference.

nasheqlbrm commented 1 year ago

Let me try running 0_preprocess.ipynb with a dpi of 220 (instead of 200) to see if a higher dpi image can help fix the issue with first three characters here.
Need to think about how to remove the extra spaces.

nasheqlbrm commented 1 year ago

A higher dpi = 220 did not help with getting a better outcome with respect to the incorrect first character in the example above. I will try an experiment where I zoom into the image a bit more to see if it helps PyTesseract.

nasheqlbrm commented 1 year ago

I tried an experiment where I was zooming into the image before text extraction and the result seemed mixed.

Later I realized that maybe I should create the images as .png rather than .jpg. I redid the text extraction based on .png images in c3571e1

For the example in the issue now we get the following text:

1 深
e

マト コン な トー

人

三
っ
-
四
】

ショ ン が 行 は 刀 英 貸 が 全 賠 的 に ポイ ョ ツ ト さ れい 民生 も 反 政 府 的 ス ョ ー ガ ン を 提げ 大 未成 天 和
を 展開 し 、 ボ ー ス 料 放 を 昌 ぶ に 至 つ た 。

背く 大 の 反 英介 運 の 謝 ま る と 共に 七 月 六 日 周 民 合議 委員 合 は 五 日 に 豆 る 合議 の 結果 印度 問
由 を 解決 する に は 完全 尾 立 の 外 方 途 な き 導 を 決議 し 、 中 央 立法 護 合 員 が 完全 に 信任 する 印度 放
立 人 政 府 知 織 を 必要 と し た 。

皮 委 員 全 の 決 難 は 七 月 二 十 七 日 プ ー ナ 1 に 開か れる 周 兵 合議 大 剣 に 於 て 可決 せら れる で あら
う 然し | 方 に 災 て 英 享 は 印度 人 の 兵士 、 技 術 家 、 符 働 者 の 張 制 後 集 を 行い 、 大 が 反 針 を 居 放
する 絡 全 園 に 互 め 示 捕 家宅 揚 索 が 行 は れ 、 公 印 を 不安 に 路 れ て わる 。

勿論 今後 の 情 益 は 験 言 を 計 さ な い が 、 英 剛 に し て 印度 の 完全 指 立 を 認め な い 限 り 前 送 の 如く
武力 革命 の 可能 性 も ある わけ で ある 、 只 問 題 は 今日 の 印度 人 が 何等 武器 を 有 し な い 臣 で ある が
世 困 情勢 の 愛 化 に 信 つ て は 印度 も 武器 を 有 し 得る の で ある 。 准 ほ 現在 西北 賠 考 用 の トラ イプ の
み は 武 二 解 除 を 受け て 居 ら ちず 今 表 大 隊 該 勤 北 後 各地 に 反 英 抗 を 行 つ て わる が 、 是 等 と 連絡 す
る 攻 に 依 つ て も 武器 を 獲得 し 得る で あら う 。

So we see that:

spaces still seem to be an issue
the first 13 lines seem to be junk
on a positive note, the issue with the first character appears to be fixed now.

nasheqlbrm commented 1 year ago

Taking a deeper look using Simon Willison's tokenizer notebook. We can further see that the text extraction is not quite correct.

Correct:

Incorrect: here I manually removed spaces to make the problem obvious. The first line is the incorrect text, the bottom is the correct text.

nasheqlbrm commented 1 year ago

[TO TEST]: It's possible that the slight skew of each page is leading to a degradation in the text that is being extracted.

nasheqlbrm / jubilant-chainsaw

Issues with extracted text - extra spaces, incorrect characters etc. #1