The tag `<UL/>` in pwg - Githubissues

The element UL is – according to pwg.dtd – undocumented. <UL/> occurs six times in pwg.xml all with the path /pwg/H1/body/UL and it is handed through as <UL> with a preceding line break into the interface:

AkarRay: ...</s><UL/><s>...

KagaYja: ...</ls> <UL/></body><tail>...

Gaw: ...<divm type="n" n="1">1)</divm> <UL/><divm type="e" n="a">...

Jara: ...</s> hinzu; vgl. <UL/> <ls>...

wakkara: ...Musik verstan</i>- <UL/> -<i>den und am...

baka: ...von Besonnenheit, aber <UL/>auch von Schelmerei...

The occurrences are all at a column break at the end the first column, but only when it coincides with the bottom line of the page but does not jump up to the top of the page again, because of a preceding end of another letter-part. (see UL_pwg_baka.jpg below) I didn’t verify whether it occurs in all of these situations. There is no further pattern in realtion to the other markup:

ul_pwg_baka

in AkarRay, it basically cuts an <s> entity in half.

in KagaYja, it is completely oddly placed, as the entry doesn’t actually continue and <UL/> occurs directly before the closing <body> tag

in Gaw, it is positioned between two <dvim> notes

in Jara it is placed sowhere inside the running text of <body>

in wakkara, it cuts an <i>-element – and in fact a word (verstanden ‘understood’) in it – into two parts

in baka, it is again located somewhere in the running text of <body>

<UL/> thus simply indicates the end of the first column, and a continuation of the text somewhere below the top of the page. While it doesn’t relate to anything in the <body> element in particular, it can effectively separate what would be one element into two, as is the case with the <i> elements in wakkara. The current treatment in the interface is probably not optimal and removing it (and the preceding line break) would be a simple fix that would improves the situation.

(relevant XPATH to the entries: /pwg/H1/h/key1[ancestor::H1/body/UL])

This <UL/> element in pwg.xml comes from the pseudo-xml form <UL> in the pwg.txt digitization. In other words, it was part of the digitization, so that's why it was included.

It was marked 'undocumented' since I didn't understand what it was. It was just waiting for someone (you!) to investigate.

Based on your investigation, I think I agree that <UL> actually serves no useful purpose and should be removed.

I am not sure about removing the preceding line break. Will have to look at the pwg.txt instances first. Will post them here so we can all take a look. Then, will plan to generate corrections.

BTW: Nice presentation of the situation.

Here are the 6 cases, as presented in pwg.txt (the digitization upon which pwg.xml is based). When corrections are made, they are made to pwg.txt

The line numbers below are per pwg.txt. I've presented a context line (the previous line) in these cases. And a suggested change.

In all cases, the <UL> is removed in the new line (= line after correction).

In all but case 5, the change is just to the line containing the <UL>.

In case 5, it seemed better to 'merge' the <UL> line with the prior line, and replace the <UL> line with a blank line.

Case 1. AkarRay
141665 old <H1>000{AkarRay}1{AkarRay}¦ mit {#upa#} {%hören, vernehmen%}: {#ityudDavAdupAkarRya suhfdAM duHsahaM#}
141666 old <UL> {#vaDam#} ¯{¤BHA10G. P. 3, 4, 23. 4, 8, 25. 10, 20, 2. 23, 13.¤}

141665 new <H1>000{AkarRay}1{AkarRay}¦ mit {#upa#} {%hören, vernehmen%}: {#ityudDavAdupAkarRya suhfdAM duHsahaM#}
141666 new {#vaDam#} ¯{¤BHA10G. P. 3, 4, 23. 4, 8, 25. 10, 20, 2. 23, 13.¤}

Case 2. KagaYja

154764 old <H1>000{KagaYja}1{KagaYja}¦ •m. N. pr. des Vaters von †{Gokarn2ec2vara} ¯{¤WILSON, Sel. Works 2, 16.¤}
154765 old 
154766 old <UL>

154764 new <H1>000{KagaYja}1{KagaYja}¦ •m. N. pr. des Vaters von †{Gokarn2ec2vara} ¯{¤WILSON, Sel. Works 2, 16.¤}
154765 new 
154766 new 

Case 3. Gaw
157118 old <H1>000{Gaw}1{Gaw}¦ ³1) {#utkaRWAGawamAnazawpadaGawA#} ¯{¤Spr. 2580.¤} -- ³2) {%Jmd%} (loc.) {%zu Theil werden, zufallen%}: {#BEmI kilAsmAsu Gawizyate (= yogaM yAsyati#} ¯{¤Schol.)¤} {#'sO#} ¯{¤NAISH. 10, 47.¤} -- ³3) {%gerathen, gelingen%} ¯{¤Spr. 5042. KATHA10S. 124, 139.¤} {%passen, am Platze sein%} ¯{¤SARVADARC2ANAS. 11, 20. 62, 14. 110, 12. 141, 12. 161, 17. NAISH. 7, 10. 9, 11. 11, 20. BHA10G. P. 10, 57, 31. 87, 31. Z. 4 lies 9, 44 st. 9, 4.¤} -- ³4) {%zusammenkommen --, sich verbinden mit%} (instr.): {#mahato ye 'vamanyante Gawante ca vimAnitEH#} ¯{¤Spr. 2139. MA10LATI10M. 38, 9.¤} -- caus. ³1)
157119 old <UL>²a) {#kAryaM suGawitaM kvApi maDye viGawate yataH#} ¯{¤Spr. 3517.¤} {#DarmipratiyogiGawita (Beda)#} {%verbunden mit%} ¯{¤SARVADARC2ANAS. 62, 2.¤} -- ²d) {#tfRaGawitaH kapawapuruzaH#} ¯{¤Spr. 3757. NAISH. 11, 20. KATHA10S. 60, 239. 90, 45. 94, 104.¤} {#Gawayati viDiraBimatamaBimuKIBUtaH#} ¯{¤Spr. 1281. KATHA10S. 104, 195.¤} {#ityupAyena GawayantyaBIzwaM budDiSAlinaH#} ¯{¤60, 250.¤} {#yaH priyamutkawaM Gawayate jantoH#} {%erweisen, thun%} ¯{¤Spr. 1238.¤} -- ²g) ¯{¤MBH. 12, 5363 und 6, 2894¤} liest die ed. Bomb. richtig {#Gawwa°#} .

157118 new <H1>000{Gaw}1{Gaw}¦ ³1) {#utkaRWAGawamAnazawpadaGawA#} ¯{¤Spr. 2580.¤} -- ³2) {%Jmd%} (loc.) {%zu Theil werden, zufallen%}: {#BEmI kilAsmAsu Gawizyate (= yogaM yAsyati#} ¯{¤Schol.)¤} {#'sO#} ¯{¤NAISH. 10, 47.¤} -- ³3) {%gerathen, gelingen%} ¯{¤Spr. 5042. KATHA10S. 124, 139.¤} {%passen, am Platze sein%} ¯{¤SARVADARC2ANAS. 11, 20. 62, 14. 110, 12. 141, 12. 161, 17. NAISH. 7, 10. 9, 11. 11, 20. BHA10G. P. 10, 57, 31. 87, 31. Z. 4 lies 9, 44 st. 9, 4.¤} -- ³4) {%zusammenkommen --, sich verbinden mit%} (instr.): {#mahato ye 'vamanyante Gawante ca vimAnitEH#} ¯{¤Spr. 2139. MA10LATI10M. 38, 9.¤} -- caus. ³1)
157119 new ²a) {#kAryaM suGawitaM kvApi maDye viGawate yataH#} ¯{¤Spr. 3517.¤} {#DarmipratiyogiGawita (Beda)#} {%verbunden mit%} ¯{¤SARVADARC2ANAS. 62, 2.¤} -- ²d) {#tfRaGawitaH kapawapuruzaH#} ¯{¤Spr. 3757. NAISH. 11, 20. KATHA10S. 60, 239. 90, 45. 94, 104.¤} {#Gawayati viDiraBimatamaBimuKIBUtaH#} ¯{¤Spr. 1281. KATHA10S. 104, 195.¤} {#ityupAyena GawayantyaBIzwaM budDiSAlinaH#} ¯{¤60, 250.¤} {#yaH priyamutkawaM Gawayate jantoH#} {%erweisen, thun%} ¯{¤Spr. 1238.¤} -- ²g) ¯{¤MBH. 12, 5363 und 6, 2894¤} liest die ed. Bomb. richtig {#Gawwa°#} .

Case 4: Jara

160218 old <H1>000{Jara}1{Jara}¦ ¯{¤Z. 2¤} streiche {#kallolinyoH#} und füge am Ende {#SElAH#} hinzu; vgl.
160219 old <UL> ¯{¤Spr. 2828¤} (v. l. {#JarA)#} .

160218 new <H1>000{Jara}1{Jara}¦ ¯{¤Z. 2¤} streiche {#kallolinyoH#} und füge am Ende {#SElAH#} hinzu; vgl.
160219 new ¯{¤Spr. 2828¤} (v. l. {#JarA)#} .

Case 5: wakkara

160241 old <H1>000{wakkara}1{wakkara},¦ ¯{¤RA10G4A-TAR. 5, 417¤} übersetzen wir: {%seine ersten Minister waren Leute, die sich auf das Grunzen und auf andere ähnliche Musik verstan%}-
160242 old <UL> -{%den und am Hofe%} (wie gemeine Sclaven) {%die Köpfe gegen den Boden schlugen, dass es klang.%}

160241 new <H1>000{wakkara}1{wakkara},¦ ¯{¤RA10G4A-TAR. 5, 417¤} übersetzen wir: {%seine ersten Minister waren Leute, die sich auf das Grunzen und auf andere ähnliche Musik verstanden und am Hofe%} (wie gemeine Sclaven) {%die Köpfe gegen den Boden schlugen, dass es klang.%}
; in this case, it seems better to merge the italic text and use
; verstanden as un-hyphenated.  Felix Agree? That's why 160242 is a blank line.
; we try to keep the number of lines in pwg.txt unchanged. An extra blank
; line has no significance.
160242 new 

Case 6: baka

171152 old <H1>000{baka}1{baka/}¦ (ved.) und {#ba/ka#} ¯{¤C2A10NT. 1, 14.¤} ³1) •m. ²a) {%eine Reiherart, Ardea nivea%} ¯{¤AK. 2, 5, 22. TRIK. 3, 3, 35. H. 1332. an. 2, 12. MED. k. 29. HALA10J. 2, 95. 5, 21. M. 5, 14. 11, 135. 12, 66. JA10G4N4. 1, 173. MBH. 3, 1208. 11579. 17315. 5, 1911. R. GORR. 2, 65, 14. SUC2R. 1, 205, 12. Spr. 740. 2008. 4072. KATHA10S. 60, 78. fgg. LA. (II) 49, 9. PAN4K4AT. 98, 9. HIT. 111, 15. fgg. BHA10G. P. 3, 10, 23¤} {#(vawa#} ed. Bomb.). ¯{¤8, 10, 10.¤} {#°SabdajYAna#} ¯{¤Verz. d. Oxf. H. 92,b,41.¤} {#vakavat - rAjan tava yaSo BAti#} ¯{¤HAEB. Anth. 483, C2l. 1.¤} {#na vyApAraSatenApi SukavatpAWyate vakaH#} ¯{¤Spr. 1528. 314.¤} {#BuNkte mOnI vakastimim#} ¯{¤4131.¤} {#haMsamaDye vako yaTA (na SoBate)#} ¯{¤2170.¤} {#bakAlInaH#} ¯{¤MBH. 12, 5309.¤} ein Ausbund von Besonnenheit, aber
171153 old <UL>auch von Schelmerei und Heuchelei: {#vakavaccintayedarTAn#} ¯{¤Spr. 2695.¤} {#vako DyAnavAn#} ¯{¤4723.¤} {#viSvastAYjalacAriRaH prakawitaDyAno 'pi BuNkte vakaH#} ¯{¤4132.¤} {#sarvendriyARi saMyamya vakavatpaRqito janaH . kAladeSopapannAni sarvakAryARi sADayet ..#} ¯{¤3218.¤} {#vakAdekam (Sikzet)#} ¯{¤3252.¤} {#vake vakavratam#} ¯{¤1357.¤} so v. a. {%Heuchler, Betrüger%}: {#AsTAnIvakEH#} ¯{¤v. l.¤} für {#AsTAnIDUrtakEH#} ¯{¤PRAB. 102, 10.¤} hierher vielleicht auch ¯{¤Verz. d. Oxf. H. 46,a,9.¤} {#vakapaYcaka#} ¯{¤87,b,5.¤} -- ²b) {%eine best. Pflanze%} ¯{¤AK. 2, 4, 2, 62. TRIK. H. an. MED. R. 5, 95, 8.¤} -- ²c) {%ein best. Apparat zum Calciniren oder Sublimiren von Metallen%} ¯{¤C2ABDAK4. im C2KDR.¤} {#kAcavakayantra#} {%Glasretorte%} ¯{¤WILS.¤} -- ²d) N. pr. eines [Page05.1641] Weisen mit dem patron. †{Da10lbhi} oder †{Da10lbhja} ¯{¤KA10T2H. 10, 6. K4HA10ND. UP. 1, 2, 13. MBH. 2, 106. 3, 968. 9, 2317.¤} -- ²e) N. pr. eines von †{Bhi10masena} besiegten †{Ra10kshasa} ¯{¤H. an. MED. MBH. 1, 2258. 3825. 6207. fgg. 3, 407. 7, 4076. 8006.¤} eines von †{Kr2shn2a} besiegten †{Asura}, der die Gestalt eines {%Reihers%} angenommen hatte, ¯{¤BHA10G. P. 10,11,47. 12,14. Verz. d. Oxf. H. 26,b,37. PAN4K4AR.3,14,29.¤} -- ²f) pl. N. pr. eines Volkes ¯{¤MBH. 6, 369.¤} {#vyUkAH kokabakAH#} ed. Bomb. st. {#bakAH kokarakAH#} der ed. Calc. -- ²g) Bein. †{Kubera's} ¯{¤H. an. MED.¤} -- ²h) N. pr. eines Fürsten ¯{¤RA10G4A-TAR. 1, 331.¤} -- ³2) •f. {#I#} ¯{¤BHA10G. P. 3, 2, 23. 10, 12, 14¤} nach dem Comm. = {#pUtanA#} . -- Vgl. {#gobaka#} .

171152 new <H1>000{baka}1{baka/}¦ (ved.) und {#ba/ka#} ¯{¤C2A10NT. 1, 14.¤} ³1) •m. ²a) {%eine Reiherart, Ardea nivea%} ¯{¤AK. 2, 5, 22. TRIK. 3, 3, 35. H. 1332. an. 2, 12. MED. k. 29. HALA10J. 2, 95. 5, 21. M. 5, 14. 11, 135. 12, 66. JA10G4N4. 1, 173. MBH. 3, 1208. 11579. 17315. 5, 1911. R. GORR. 2, 65, 14. SUC2R. 1, 205, 12. Spr. 740. 2008. 4072. KATHA10S. 60, 78. fgg. LA. (II) 49, 9. PAN4K4AT. 98, 9. HIT. 111, 15. fgg. BHA10G. P. 3, 10, 23¤} {#(vawa#} ed. Bomb.). ¯{¤8, 10, 10.¤} {#°SabdajYAna#} ¯{¤Verz. d. Oxf. H. 92,b,41.¤} {#vakavat - rAjan tava yaSo BAti#} ¯{¤HAEB. Anth. 483, C2l. 1.¤} {#na vyApAraSatenApi SukavatpAWyate vakaH#} ¯{¤Spr. 1528. 314.¤} {#BuNkte mOnI vakastimim#} ¯{¤4131.¤} {#haMsamaDye vako yaTA (na SoBate)#} ¯{¤2170.¤} {#bakAlInaH#} ¯{¤MBH. 12, 5309.¤} ein Ausbund von Besonnenheit, aber

171153 new auch von Schelmerei und Heuchelei: {#vakavaccintayedarTAn#} ¯{¤Spr. 2695.¤} {#vako DyAnavAn#} ¯{¤4723.¤} {#viSvastAYjalacAriRaH prakawitaDyAno 'pi BuNkte vakaH#} ¯{¤4132.¤} {#sarvendriyARi saMyamya vakavatpaRqito janaH . kAladeSopapannAni sarvakAryARi sADayet ..#} ¯{¤3218.¤} {#vakAdekam (Sikzet)#} ¯{¤3252.¤} {#vake vakavratam#} ¯{¤1357.¤} so v. a. {%Heuchler, Betrüger%}: {#AsTAnIvakEH#} ¯{¤v. l.¤} für {#AsTAnIDUrtakEH#} ¯{¤PRAB. 102, 10.¤} hierher vielleicht auch ¯{¤Verz. d. Oxf. H. 46,a,9.¤} {#vakapaYcaka#} ¯{¤87,b,5.¤} -- ²b) {%eine best. Pflanze%} ¯{¤AK. 2, 4, 2, 62. TRIK. H. an. MED. R. 5, 95, 8.¤} -- ²c) {%ein best. Apparat zum Calciniren oder Sublimiren von Metallen%} ¯{¤C2ABDAK4. im C2KDR.¤} {#kAcavakayantra#} {%Glasretorte%} ¯{¤WILS.¤} -- ²d) N. pr. eines [Page05.1641] Weisen mit dem patron. †{Da10lbhi} oder †{Da10lbhja} ¯{¤KA10T2H. 10, 6. K4HA10ND. UP. 1, 2, 13. MBH. 2, 106. 3, 968. 9, 2317.¤} -- ²e) N. pr. eines von †{Bhi10masena} besiegten †{Ra10kshasa} ¯{¤H. an. MED. MBH. 1, 2258. 3825. 6207. fgg. 3, 407. 7, 4076. 8006.¤} eines von †{Kr2shn2a} besiegten †{Asura}, der die Gestalt eines {%Reihers%} angenommen hatte, ¯{¤BHA10G. P. 10,11,47. 12,14. Verz. d. Oxf. H. 26,b,37. PAN4K4AR.3,14,29.¤} -- ²f) pl. N. pr. eines Volkes ¯{¤MBH. 6, 369.¤} {#vyUkAH kokabakAH#} ed. Bomb. st. {#bakAH kokarakAH#} der ed. Calc. -- ²g) Bein. †{Kubera's} ¯{¤H. an. MED.¤} -- ²h) N. pr. eines Fürsten ¯{¤RA10G4A-TAR. 1, 331.¤} -- ³2) •f. {#I#} ¯{¤BHA10G. P. 3, 2, 23. 10, 12, 14¤} nach dem Comm. = {#pUtanA#} . -- Vgl. {#gobaka#} .

pwg.txt digitization does not code the actual line breaks in the scan. That's part of the reason why some of the above lines are long.

@fxru - If you concur with the indicated changes, I'll go ahead and install them.

@funderburkjim that looks like a reasonable solution. In the case of wakkara it is particularly intrusive without any useful informational content, so joining them is probably the most helpful for the user (and consistent with the treatment of other cases of hyphenation).

Long Answer:

I was (and still am) not sure whether <UL/> encodes any important information, so I just quickly looked at how column breaks are dealt with in pwg in general. The last of the <UL/> examples in the entry baka is actually one of three column breaks on that page (see picture at the end of the comment) and illustrating the situation quite nicely. In the end, all three break from [Page05.1639] to [Page05.1640], but are encoded differently:

1) [Page05.1640]

-- <divm type="n" n="4">4)</divm> <i>hinundherschwanken</i> (vom Geiste)
<ls>UTTARARA10MAK4. 114, 15 (155, 9).</ls> [Page05.1640]</body><tail>
<L>79736</L><pc>5-1638</pc></tail></H1>

2) nothing

<H1><h><key1>Palahaka</key1><key2>Palahaka</key2></h><body><ls>KATHA10S. 52, 328.
334.</ls></body><tail><L>79757</L><pc>5-1640</pc></tail></H1>
<H1><h><key1>PalahI</key1><key2>PalahI</key2></h><body><gram n="f">f.</gram> 
<i>Baumwollenstaude</i> <ls>HA10LA 166. 363. fg.</ls></body><tail><L>79758</L><pc>5-
1640</pc></tail></H1>

3) <UL/>

<s>bakAlInaH</s> <ls>MBH. 12, 5309.</ls> ein Ausbund von Besonnenheit, aber <UL/>auch 
von Schelmerei und Heuchelei: <s>vakavaccintayedarTAn</s> <ls>Spr. 2695.</ls>

My guess is that the whole situation is the result from an underlying assumption during digitization (or the initial processing of the digitization?) that there is only one column break per page.

Anyways, I’m still not sure what to make of <UL/>. The first marking type ([Page05.1640]) is the general one and to me also the most intuitive, most informative. However, the fact that type (2) exists makes the whole <UL/> marking seem rather pointless.

I guess, the one other solution would be to replace <UL/> by the first type of marking ([Page05.1640]), if that doesn’t mess up other parts of the system! In any case, that still leaves the unmarked cases, and I have no idea how to catch those (not even sure there is more than one).

pwg5-1639

My guess is that the whole situation is the result from an underlying assumption during digitization (or the initial processing of the digitization?) that there is only one column break per page.

Sounds reasonable, @fxru.

Do I understood you right, that your are thinking:

to kill
to replace?

Yes, I think <UL/> does not encode anything different than a column break. Ideally, all column breaks would be treated identically, but right now there a three ways.

Type (1) is the ideal case. Most column breaks are encoded as PageXX.XXXX and that is also the most helpful way for the user (I guess). However, only @funderburkjim knows whether any parts of the code work on the assumption that there is only one column break per page. If we replace <UL/> manually by PageXX.XXXX we might get side effects.

I have no idea how to find type (2) (i.e. not encoded) column breaks, else I would suggest the same treatment for them.

If we can’t replace them, then deleting them would at least remove them from the interface.

Pages with muliple sections (due to new letters), like you show for letter 'P' and 'b' above, are handled irregularly among the various dictionary digitizations. From your investigation, it is possible that the <UL> served this purpose in PWG; but your example Palahaka also shows this was not uniformly done in the digitization.

You are right that there can be differences in the digitization (pwg.txt) and the xml file (pwg.xml). This is because the xml file is created by a program using (a) pwg.txt and (b) pwghw2.txt (headwords with associated line-number ranges in pwg.txt and with a (single) page-col number.).
The displays are based on the xml form. Thus, an oddity in the display could be due to an oddity in the digitization or due to an oddity in the derived xml or due to an oddity in the program which generates the html of the display from the xml.

You could possibly gain some additional insight into the page-column question by examining the digitization pwg.txt. This is available via the txt download on the PWG download page.

My view is that it is probably not worthwhile to spend much time on this - but that's just an opinion. Here are some observations that come to mind in this regard:

The only real use of the page-number information is so we can have a link to the scanned image from the displays. The intent is that the link go to the page where the headword starts.
The multi-letter pages (P,b above) are rare; it would required some complicated way to encode these special cases to accurately reflect the original page layout. This is not important (my opinion).
PWG digitization was done ca. 2006. At that time, Thomas had not yet developed the notion that the digitization should accurately encode line-breaks. Thus, the correlation in PWG between lines in the digitization and lines in the printed edition is a loose correlation. It would be better if lines in the digitization represented lines (in a column) in the printed edition. But adding this feature to the markup is a daunting task, which we do not have the manpower to attempt.

I'm installing the changes as indicated above. @fxru You might want to recheck the displays for the 6 cases, just to convince yourself that the changes have been made.

I think this issue can be closed now.

@fxru Will let you do the honors of closing, since you opened the issue.

P.S. Changed pwg.dtd and also pwg-meta.txt regarding 'UL'.

Yes, I totally agree and it looks much more coherent for the user.

My view is that it is probably not worthwhile to spend much time on this - but that's just an opinion.

Agree.

The multi-letter pages (P,b above) are rare; it would required some complicated way to encode these special cases to accurately reflect the original page layout. This is not important (my opinion).

Agree, not worth it.

PWG digitization was done ca. 2006.

I remember getting a CD with the scans from Thomas. Those were the days...

It would be better if lines in the digitization represented lines (in a column) in the printed edition. But adding this feature to the markup is a daunting task, which we do not have the manpower to attempt.

Indeed.

sanskrit-lexicon / PWG

The tag `<UL/>` in pwg #21

Long Answer: