Correct transcription of Mueller vol 2a pages 1-5 (contents) and index pages 1-5, 124-129

nathangibson commented 4 years ago

See https://github.com/usaybia/usaybia-data/wiki/Transcription-of-Sources-from-Page-Images

After page 5 you can clean up the layout but wait for us to retrain the model before transcribing more.

nathangibson commented 4 years ago

@RobinSchmahl I was confused yesterday and forgot that we need both the contents and the index. What I showed you yesterday was the table of contents. In addition to that, please also do the following in the document "IU-mueller-1884-indices-ids":

pages 1-5
pages 124-129 (because this is a different font it will need more training)

Between the name and the number please put a tab, and between numbers please also put a tab. You will also see a hyphen (-) used as a cross-reference to other entries. The letter ب is used to indicate vol. 2. There are also asterisks (*). All of these can be transcribed as you see them.

You can post any questions about the task in this issue.

nathangibson commented 4 years ago

Also for the index that starts on page scan 124, please use 5 underscores (ـــــ) for the blanks that show where a name is repeated from the previous line. E.g., on page scan 125, line 1-19: ـــــ ـــــ بن زهر الحفيد ب 67 67 الى 74 78 80 Note the tabs ـــــ ـــــ بن زهر الحفيد↤ب↤67↤67 الى 74↤78↤80

nathangibson commented 4 years ago

@RobinSchmahl How is this going? I see that you've done some of the table of contents.

RobinSchmahl commented 4 years ago

It's going, but slowly. The software is a bit difficult to handle especially when working with Arabic text and numbers (!). Have you checked the index and whether I've worked correctly so far?

nathangibson commented 4 years ago

It's looking very good overall. There are some minor points to pay attention to

When I copy and paste the text into a text editor, I see that many of the numbers are on the wrong side of the name. E.g. on image 2 lines 1-4 to 1-8 are correct, the lines after that are not. It should be number, then tab, then name. I know this can be hard to see and edit in Transkribus. One possibility is to copy it from Transkribus into your preferred text editor for right-to-left text, correct it, then paste it back in.
Please see the Cleaning up the layout section of the wiki page about text regions, page numbers, and catchwords. There should be a separate text region (green box) for page numbers, for catchwords, and for the main text. In other words, anything outside the printed double black box should have its own text region and be transcribed and the structure labeled. I've done this on image 2 (page 1) as an example. Anything at the bottom of the page that is not a catchword can be labeled as a "book-binding" structure.
Please make extra sure the numbers are correct. Mostly this is fine but image 2, line 1-9 has "2" instead of "23".
Some lines have extra spaces in the transcription.

Hope this helps. Let me know if you have questions!

RobinSchmahl commented 4 years ago

I've just finished the first five pages in the first document. Maybe you should check whether I've done everything correctly, but it should be fine now. (I've double checked in Word). Do I have to assign a special structure to the main text field?

Thanks!

nathangibson commented 4 years ago

This looks really great. The numbers are now on the correct side.

You don't need to assign a structure to the main text field.

There were only two types of small things, which I corrected. One is that some of the numbers still had extra spaces at the beginning of the line. I just deleted these. The other is that sometimes the page number or catchword did not "belong" to the correct text region. (I'm learning this too.) You can see that by using the layout tab in the left sidebar. (See screenshot.) Sometimes even when you adjust the text regions on the page, the lines still belong to the old text region and you have to drag them in the layout tab to the correct text region (as they are in the screenshot). You can also see this in the line numbers -- "1-1" means first text region, first line. So you'll normally have 1-1 (page number), 2... (text), 3-1 (catchword).

nathangibson commented 4 years ago

You could move on now to pages 1-5 and 124-129 in the index document (see the instructions above). I did this line on image 2 as an example: آدم عليه السلام ٩ ١٦ ٧٢ ٧٣ ٢٠٠ ٢٤٨ ب ١٣٠

In this text it can be very hard to see the difference between the digits "0" and "5". You may need to check this occasionally.

Thanks!

nathangibson commented 4 years ago

@RobinSchmahl How is it going with this? It looks like you've done through image 5?

Not sure if you're done correcting image 5, but I see that there are some issues with the character ب. It is used in the index to indicate volume 2. It looks like sometimes it is being recognized as a hyphen (-), sometimes without a space separating it, or sometimes not at all. See lines 2-1, 2-22, 2-24, 2-26 on image 5. Thanks!

nathangibson commented 4 years ago

@RobinSchmahl Congrats on finishing this! I'm closing the issue.

I've started training the model on the second index. Preliminary results are quite good (see image 130, where mostly there are only small issues except for line 2-4). So I'm hopeful we'll be able to use the text recognition on this section as well, pasting it into the spreadsheet.

For future reference: The spreadsheet we're pasting the OCR indices into is here: https://docs.google.com/spreadsheets/d/1qm4SUeYHBjCn43Wxms1Btn8SbhVEngiiG8h8NxTUjpc/edit#gid=1096270910

usaybia / usaybia-data

Correct transcription of Mueller vol 2a pages 1-5 (contents) and index pages 1-5, 124-129 #16