Closed drdhaval2785 closed 7 years ago
One more observation. There are more than one entries encoded in one line in rare cases. They need systematic investigation and addition to XML properly as separate entries / sub entries See
.{#aka#}100{#aka#}^1¦ 1. Unheil , Ka1s4i1kh. 24 , 17. [Schµ25] |{#aka#} ºAdj. = {%akutsita%} , S II , 236 , 2 (Ko. 14 v.u.). -- {%na1sti…kam2…s4iro…yes2a1m2…te…'ka1h2%} , H 43 , 288; º= {%kut2ilaga1min%} , H 43 , 349. [Schµ26] €5
[Schµ25]
and [Schµ26]
refer to two separate entries. Currently this feature is not incorporated into XML.
Image
Current web display
The first entry is missing from SCH XML file and subsequent products.
Great catch.
There are more than one entries encoded in one line in rare cases.
Yes, indeed. What I wonder is - if there is 1. aka, why there is no 1. aka and just aka as a different part of speech.
the first entry missing
This is a bug in the first step of the headword detection process (hw0.py). The program skips several lines at the beginning of the file, until it runs into the first [Page] reference. Since this page reference is buried in the line (with the first form of 'a'), it erroneously skips that whole line also.
This would normally be easy to fix. However, doing the obvious program change would change all the L-numbers of SCH --- which would flow through to sanhw2 having thousands of changes.
I'll try to think of a way to make the correction without changing all the L-numbers.
more than one entries encoded in one line in rare cases
From the example, it may be that such cases are recognized by having more than one instance of
[SchµXX]
in a line of sch.txt. If so, there are 366 such cases.
It further appears that there is a vertical bar which identifies the end of the first part:
.{#aka#}100{#aka#}^1¦ .... [Schµ25] |{#aka#} ....
In fact, Thomas has this note in sch.txt:
Homophones marked by | have been added to the preceding entry
The [SchµXX]
designation was added by Thomas as a sequence number for the 'outdented' (non-indented) lines in the print
The case of 'aka' is an instance of an unnumbered homophone, because (this is speculation) not all the entries start with a number; namely, the 2nd entry for 'aka' in Sch. does not start with a number.
The case of 'a' is an instance of a numbered homophone.
The distinction is that the first character of each entry is a digit for the numbered homophones.
In the numbered homophone case, the numbers refers to the PW homophone number. The example of 'a' in PW confirms this:
In sch.txt, there are two kinds of homophone coding. One of them is like 'a', where there is a separate record for each homophone entry. The other is like 'aka', where the two homophone entries are coded together in one line of the digitization.
From the examples I've looked at, there seems to be no good reason for the 'aka'-type coding. I think it would be better to have separate records of the sch.txt digitization for each homophone entry,
The implication for changes to the sch.txt coding would be
The detailed recoding of our prototypical single-record homophone case would look like:
current coding all as one line of sch.txt. Our current L-number is L=24
.{#aka#}100{#aka#}^1¦ 1. Unheil , Ka1s4i1kh. 24 , 17. [Schµ25] |{#aka#} ºAdj. = {%akutsita%} , S II , 236 , 2 (Ko. 14 v.u.). -- {%na1sti…kam2…s4iro…yes2a1m2…te…'ka1h2%} , H 43 , 288; º= {%kut2ilaga1min%} , H 43 , 349. [Schµ26] €5
Possible revised coding. Use two lines, breaking at the vertical bar
first record, L=24
.{#aka#}100{#aka#}^1¦ 1. Unheil , Ka1s4i1kh. 24 , 17. [Schµ25]
second record, L=24.1
.{#aka#}100{#aka#}¦ ºAdj. = {%akutsita%} , S II , 236 , 2 (Ko. 14 v.u.). -- {%na1sti…kam2…s4iro…yes2a1m2…te…'ka1h2%} , H 43 , 288; º= {%kut2ilaga1min%} , H 43 , 349. [Schµ26] €5
I'll wait for others to comment on this before developing programs to carry out the agenda described above.
I agree with proposed methodology.
no good reason for the 'aka'-type coding.
Agree.
The detailed recoding of our prototypical single-record homophone case would look like:
Perfect, as usual.
Thanks for feedback. I'll get started with implementation.
These changes have now been made, and are reflected in the Basic and related displays.
For comparison with the previous coding, the previous version is still available in list-02.html display.
In addition to the headwords above, some other good headwords to compare are:
<div>
in new displays)Request others to do some random checking and provide feedback. When all looks ok, I'll finish installing. Meanwhile, I'll work on documenting the changes.
ac [Cologne record id=666] [printed page link 010-1]
I would go for ac [Cologne record ID=666] [Printed book page 010-1]
Very well looking.
@gasyoun Thank you for formatting suggestion. It is now implemented for sch.
The documentation of changes to sch digitization have been posted as an issue comment in the SCH repository; here. I think this current work with SCH can be considered finished.
When @drdhaval2785 concurs, I'll go ahead and generate the full installation, including S3 backups and update for list-0.2.html.
I concur.
Everything now installed.
This issue can be closed.
Hurray!
First two lines of sch.txt
Whereas the sch.xml has only one line.
The first entry is missing from SCH XML file and subsequent products.