SCH - Entries missed in conversion from txt to XML

drdhaval2785 commented 7 years ago

First two lines of sch.txt

.{#a#}100{#a°#}^2¦ , {%asvaptum%} Ta1n2d2ya-Br. 10 , 4 , 4. [Page001.1] [Schµ1] €1
.{#a#}100{#a#}^4¦ m. º= {%sarvajn5o…'rhan%} , S I , 53 , 3. -- Vis2n2u , H 31 , 9; Va1s. 113 , 1. [Schµ2]* €2

Whereas the sch.xml has only one line.

<H1><h><key1>a</key1><key2>a</key2><hom>4</hom></h><body>m. º= <i>sarvajn5o…'rhan</i> , S I , 53 , 3. -- Vis2n2u , H 31 , 9; Va1s. 113 , 1. [Schµ2]* €2</body><tail><L>1</L><pc>001-1</pc></tail></H1>

The first entry is missing from SCH XML file and subsequent products.

drdhaval2785 commented 7 years ago

One more observation. There are more than one entries encoded in one line in rare cases. They need systematic investigation and addition to XML properly as separate entries / sub entries See

.{#aka#}100{#aka#}^1¦ 1. Unheil , Ka1s4i1kh. 24 , 17. [Schµ25] |{#aka#} ºAdj. = {%akutsita%} , S II , 236 , 2 (Ko. 14 v.u.). -- {%na1sti…kam2…s4iro…yes2a1m2…te…'ka1h2%} , H 43 , 288; º= {%kut2ilaga1min%} , H 43 , 349. [Schµ26] €5

[Schµ25] and [Schµ26] refer to two separate entries. Currently this feature is not incorporated into XML.

Image screenshot from 2017-04-15 22-40-19

Current web display screenshot from 2017-04-15 22-41-01

gasyoun commented 7 years ago

The first entry is missing from SCH XML file and subsequent products.

Great catch.

There are more than one entries encoded in one line in rare cases.

Yes, indeed. What I wonder is - if there is 1. aka, why there is no 1. aka and just aka as a different part of speech.

funderburkjim commented 7 years ago

the first entry missing

This is a bug in the first step of the headword detection process (hw0.py). The program skips several lines at the beginning of the file, until it runs into the first [Page] reference. Since this page reference is buried in the line (with the first form of 'a'), it erroneously skips that whole line also.

This would normally be easy to fix. However, doing the obvious program change would change all the L-numbers of SCH --- which would flow through to sanhw2 having thousands of changes.

I'll try to think of a way to make the correction without changing all the L-numbers.

funderburkjim commented 7 years ago

more than one entries encoded in one line in rare cases

From the example, it may be that such cases are recognized by having more than one instance of [SchµXX] in a line of sch.txt. If so, there are 366 such cases.

It further appears that there is a vertical bar which identifies the end of the first part:

.{#aka#}100{#aka#}^1¦   ....   [Schµ25] |{#aka#}  ....

In fact, Thomas has this note in sch.txt:

Homophones marked by | have been added to the preceding entry

funderburkjim commented 7 years ago

[SchµXX] markup

The [SchµXX] designation was added by Thomas as a sequence number for the 'outdented' (non-indented) lines in the print

funderburkjim commented 7 years ago

numbered and un-numbered homophones

The case of 'aka' is an instance of an unnumbered homophone, because (this is speculation) not all the entries start with a number; namely, the 2nd entry for 'aka' in Sch. does not start with a number.

The case of 'a' is an instance of a numbered homophone.

The distinction is that the first character of each entry is a digit for the numbered homophones.

In the numbered homophone case, the numbers refers to the PW homophone number. The example of 'a' in PW confirms this:

funderburkjim commented 7 years ago

sch.txt coding of numbered and unnumbered homophones

In sch.txt, there are two kinds of homophone coding. One of them is like 'a', where there is a separate record for each homophone entry. The other is like 'aka', where the two homophone entries are coded together in one line of the digitization.

From the examples I've looked at, there seems to be no good reason for the 'aka'-type coding. I think it would be better to have separate records of the sch.txt digitization for each homophone entry,

recoding implication

The implication for changes to the sch.txt coding would be

Leave the multiple-record homophone cases unchanged.
- The 'a' case diverges from this. Rather than change all the L-numbers, use L=1 and L=1.1 for the two record of 'a'
Change the single-record homophone cases (like aka) to multiple records, and use decimal numbers for the additional L numbers.

The detailed recoding of our prototypical single-record homophone case would look like:

current coding all as one line of sch.txt. Our current L-number is L=24
.{#aka#}100{#aka#}^1¦ 1. Unheil , Ka1s4i1kh. 24 , 17. [Schµ25] |{#aka#} ºAdj. = {%akutsita%} , S II , 236 , 2 (Ko. 14 v.u.). -- {%na1sti…kam2…s4iro…yes2a1m2…te…'ka1h2%} , H 43 , 288; º= {%kut2ilaga1min%} , H 43 , 349. [Schµ26] €5

Possible revised coding. Use two lines, breaking at the vertical bar
first record, L=24
.{#aka#}100{#aka#}^1¦ 1. Unheil , Ka1s4i1kh. 24 , 17. [Schµ25] 
second record, L=24.1
.{#aka#}100{#aka#}¦  ºAdj. = {%akutsita%} , S II , 236 , 2 (Ko. 14 v.u.). -- {%na1sti…kam2…s4iro…yes2a1m2…te…'ka1h2%} , H 43 , 288; º= {%kut2ilaga1min%} , H 43 , 349. [Schµ26] €5

funderburkjim commented 7 years ago

I'll wait for others to comment on this before developing programs to carry out the agenda described above.

drdhaval2785 commented 7 years ago

I agree with proposed methodology.

gasyoun commented 7 years ago

no good reason for the 'aka'-type coding.

Agree.

The detailed recoding of our prototypical single-record homophone case would look like:

Perfect, as usual.

funderburkjim commented 7 years ago

Thanks for feedback. I'll get started with implementation.

funderburkjim commented 7 years ago

These changes have now been made, and are reflected in the Basic and related displays.

For comparison with the previous coding, the previous version is still available in list-02.html display.

In addition to the headwords above, some other good headwords to compare are:

a (first record now visible)
ac ( shows presence of <div> in new displays)
aNku (shows proper placement of º before key2)
U - not there in new display (as mentioned in another issue
sch contains only modern IAST, no AS coding (probably not visible in displays)
L-numbers are the same as before, except for the 350+ new entries involved in splitting the homonyms.

Request others to do some random checking and provide feedback. When all looks ok, I'll finish installing. Meanwhile, I'll work on documenting the changes.

gasyoun commented 7 years ago

ac [Cologne record id=666] [printed page link 010-1]

I would go for ac [Cologne record ID=666] [Printed book page 010-1]

Very well looking.

funderburkjim commented 7 years ago

@gasyoun Thank you for formatting suggestion. It is now implemented for sch.

funderburkjim commented 7 years ago

The documentation of changes to sch digitization have been posted as an issue comment in the SCH repository; here. I think this current work with SCH can be considered finished.

funderburkjim commented 7 years ago

When @drdhaval2785 concurs, I'll go ahead and generate the full installation, including S3 backups and update for list-0.2.html.

drdhaval2785 commented 7 years ago

I concur.

funderburkjim commented 7 years ago

Everything now installed.

This issue can be closed.

gasyoun commented 7 years ago

Hurray!

sanskrit-lexicon / CORRECTIONS