Textual transcript via OCR

leovinus2001 commented 2 months ago

Nice images but could someone elaborate what the plane is here? Probably something like (1) make images more readable (2) convert image to text (3) run something on a simulator. Which directory first? For (1), as example, attached a cleaner version of the top part of https://github.com/pdp11/camexec/blob/master/photos/ddt/20240331_091745.jpg

larsbrinkhoff commented 2 months ago

Yes, that would be the plan. I asked @rcornwell about OCR, so I'm curious to see what results he can get from these photos. Maybe a special purpose Tesseract model could be another option.

The camexec directory should probably be the first, since it's the operating system. Next ddt.

leovinus2001 commented 2 months ago

Yes, that would be the plan. I asked @rcornwell about OCR, so I'm curious to see what results he can get from these photos. Maybe a special purpose Tesseract model could be another option.

That makes sense as actually @rcornwell pointed this topic out to me so there we are.

The camexec directory should probably be the first, since it's the operating system. Next ddt.

Will have a look and will keep you posted

larsbrinkhoff commented 2 months ago

Oh, so you are Rich's OCR friend? Thanks so much for taking a look! I'm curious what you can come up with.

leovinus2001 commented 2 months ago

While I am looking at five pages at random, I can see an issue straight away. The paper was sometimes folded-up and that leads to subtle but non-linear distortions which complicate the OCR. One of the five is out-of-focus as well. Is there any chance that the photos/scans could be taken again but with proper "flat" paper scan? It would save work ;) In the meantime, I'll try a few things.

larsbrinkhoff commented 2 months ago

The pages with a crease can be typed in manually. Only the first page in each listing has been folded so it should not be a big job. Those that were out of focus have been photographed again:
https://github.com/pdp11/camexec/commit/864de35d9e39b481e054baef41da7fcb670206a3

Please let me know if you see any other pages that are too blurry.

leovinus2001 commented 2 months ago

Ok. The question had to be asked :) It is a fun challenge for a hobby. To set expectations - reasonable text can be extracted here but it will not be perfect due to the quality of the inputs. In any case, am working on those five testpages at the moment to gauge what transcription quality I can get out. That involves image enhancement, producing high contrast PNGs for OCR, transcribing the groundtruth, retraining a model with this type of data. It is progressing but will take a few days. Then, I will attach the OCR input PNGs , recognition result and groundtruth. Then, you can check the PNGs against the recognition results and groundtruth and check whether that matches expectations. After all, the goal here is to produce a compilable text file and decide which process is the easiest to achieve that. PS: The top part of 20240331_085227 is blurry as well

larsbrinkhoff commented 2 months ago

Thanks!

20240331_085227 should be the same as camexec-003-1.jpg

leovinus2001 commented 2 months ago

Attached a tgz file with OCR on 8 different pages. Seven from camex/ and one from ddt/ ( the orginal example in post #1 )

Please have a look whether you like the quality of the "recognized text" files. A side-by-side comparison of reference vs recognized should show the effects. apr23.tgz I also included data/camex/error.pattern.apr23.png as one example of the error patterns.

As for the process, I took your JPG files like 20240331_085227. Then split the page in two pages _1 and _2, do image enhancement via ImageMagick and then run an OCR tool. The zero characters here are quite different :) The low contrast original JPG also leads to image artifacts in the black/white/gray PNGs but the OCR can handle most of it. Due to the non-linear "bendy" deformations I had to slice some pages into 2 or 3 vertical or horizontal slices like _1_1 _1_2 _1_3 and then combine again.

If you know a better image enhancement then that would help as well :)

On the four pages of training data, the error-rate per-character is less than ~1% or so. Most issues are with characters like ':' and 'W' which need more examples. On other pages somewhat higher errorrate due to layout issues and unknown characters (like a blob for '#'). Accuracy can be improved with a 20 minute re-training. This is via my own tools as Tesseract does not like this data.

Before doing any more work, I'd like to check with you that this is useful for you?

It does require later manual error checking for every page to get text like the "references/pdp11/camex/" files which can be compiled ultimately. On the good side, sometimes the OCR is better than me and can tell me what is a '=' or '-'. On the bad side, proof reading all text will require time by you.

If you like this, I'll see that I convert next week another 5 or 10 pages to verify the process and then we take it from there. You have only 123 JPG in camex so not the end of the world.

PS: The tar ball has "per image" something like

data/camex/20240331_085227/20240331_085227_2.png (part 2 or your JPG)
data/camex/20240331_085227/recog.20240331_085227_2.txt (recognized text)
references/pdp11/camex/ref.20240331_085227_2.txt ( manually checked text as reference for error count)

larsbrinkhoff commented 2 months ago

Thanks! I have downloaded the tarball and will take a look.

larsbrinkhoff commented 2 months ago

I checked, and this looks very useful indeed. Thanks a lot. Of course, some pages fare worse than others, but those can be fixed manually or typed in from scratch. The majority seem good and only need minor corrections.

If it's useful for training, I can take a set of pages, do the corrections and send them back to you.

leovinus2001 commented 2 months ago

Yeah, it seems that a useful transcript can be produced with a mix of some manual work plus mostly automatic transcription.

A few more pages (5?) with reference transcriptions would be useful. Especially if you select pages with more "rare" symbols such as >, <, ;, W, Y symbols which I can use that to retrain and apply to everything again.

larsbrinkhoff commented 1 month ago

Here are five more pages. I tried to select those that have the glyphs you requested.

samples.zip

leovinus2001 commented 1 month ago

Here are five more pages. I tried to select those that have the glyphs you requested.

Cool! Thanks for those. Will integrate this into training materials and then we can start to process the other pages.

Am in the middle of a move but I'll see what I can do.

larsbrinkhoff commented 1 month ago

Thanks! No urgency, take your time. For me this is more a long-term back-burner project.

larsbrinkhoff commented 1 month ago

This may be of interest to you: https://mzucker.github.io/2016/08/15/page-dewarping.html

leovinus2001 commented 1 week ago

Just a quick note that I have not forgotten about this. Will have another go in a few weeks.

larsbrinkhoff commented 6 days ago

Thanks, sounds great!

pdp11 / camexec

Textual transcript via OCR #1