usaybia / usaybia-data

Data for interreligious interaction in Near Eastern texts
MIT License
2 stars 2 forks source link

Correct transcription of Mueller vol 2a pages 1-5 (contents) and index pages 1-5, 124-129 #16

Closed nathangibson closed 4 years ago

nathangibson commented 4 years ago

See https://github.com/usaybia/usaybia-data/wiki/Transcription-of-Sources-from-Page-Images

After page 5 you can clean up the layout but wait for us to retrain the model before transcribing more.

nathangibson commented 4 years ago

@RobinSchmahl I was confused yesterday and forgot that we need both the contents and the index. What I showed you yesterday was the table of contents. In addition to that, please also do the following in the document "IU-mueller-1884-indices-ids":

Between the name and the number please put a tab, and between numbers please also put a tab. You will also see a hyphen (-) used as a cross-reference to other entries. The letter ب is used to indicate vol. 2. There are also asterisks (*). All of these can be transcribed as you see them.

You can post any questions about the task in this issue.

nathangibson commented 4 years ago

Also for the index that starts on page scan 124, please use 5 underscores (ـــــ) for the blanks that show where a name is repeated from the previous line. E.g., on page scan 125, line 1-19: ـــــ ـــــ بن زهر الحفيد ب 67 67 الى 74 78 80 Note the tabs ـــــ ـــــ بن زهر الحفيد↤ب↤67↤67 الى 74↤78↤80

nathangibson commented 4 years ago

@RobinSchmahl How is this going? I see that you've done some of the table of contents.

RobinSchmahl commented 4 years ago

It's going, but slowly. The software is a bit difficult to handle especially when working with Arabic text and numbers (!). Have you checked the index and whether I've worked correctly so far?

nathangibson commented 4 years ago

It's looking very good overall. There are some minor points to pay attention to

Hope this helps. Let me know if you have questions!

RobinSchmahl commented 4 years ago

I've just finished the first five pages in the first document. Maybe you should check whether I've done everything correctly, but it should be fine now. (I've double checked in Word). Do I have to assign a special structure to the main text field?

Thanks!

nathangibson commented 4 years ago

This looks really great. The numbers are now on the correct side.

You don't need to assign a structure to the main text field.

There were only two types of small things, which I corrected. One is that some of the numbers still had extra spaces at the beginning of the line. I just deleted these. The other is that sometimes the page number or catchword did not "belong" to the correct text region. (I'm learning this too.) You can see that by using the layout tab in the left sidebar. (See screenshot.) Sometimes even when you adjust the text regions on the page, the lines still belong to the old text region and you have to drag them in the layout tab to the correct text region (as they are in the screenshot). You can also see this in the line numbers -- "1-1" means first text region, first line. So you'll normally have 1-1 (page number), 2... (text), 3-1 (catchword).

Bildschirmfoto 2019-11-08 um 14 04 50
nathangibson commented 4 years ago

You could move on now to pages 1-5 and 124-129 in the index document (see the instructions above). I did this line on image 2 as an example: آدم عليه السلام ٩ ١٦ ٧٢ ٧٣ ٢٠٠ ٢٤٨ ب ١٣٠

In this text it can be very hard to see the difference between the digits "0" and "5". You may need to check this occasionally.

Thanks!

nathangibson commented 4 years ago

@RobinSchmahl How is it going with this? It looks like you've done through image 5?

Not sure if you're done correcting image 5, but I see that there are some issues with the character ب. It is used in the index to indicate volume 2. It looks like sometimes it is being recognized as a hyphen (-), sometimes without a space separating it, or sometimes not at all. See lines 2-1, 2-22, 2-24, 2-26 on image 5. Thanks!

nathangibson commented 4 years ago

@RobinSchmahl Congrats on finishing this! I'm closing the issue.

I've started training the model on the second index. Preliminary results are quite good (see image 130, where mostly there are only small issues except for line 2-4). So I'm hopeful we'll be able to use the text recognition on this section as well, pasting it into the spreadsheet.

For future reference: The spreadsheet we're pasting the OCR indices into is here: https://docs.google.com/spreadsheets/d/1qm4SUeYHBjCn43Wxms1Btn8SbhVEngiiG8h8NxTUjpc/edit#gid=1096270910