PW meta-line conversion

funderburkjim commented 7 years ago

This issue devoted to meta-line conversion of PW dictionary.

The markup of the Cologne digitization of this dictionary is quite complex. In addition to adapting the form of the digitization to the meta-line form, attention will be given to making the markup less idiosyncratic, while maintaining informational equivalence with the original markup that Thomas Malten devised.

I anticipate this will be a rather lengthy process. I will aim to indicate in this issue (and perhaps related issues) all of the changes and choices made in the process of markup conversion.

It will likely be relevant to consider the markup of PWG dictionary during this process, as Thomas did the digitizations of both PWG and PW at about the same time, ca. 2005/6 and used many of the same coding conventions in both dictionaries.

gasyoun commented 7 years ago

same coding conventions in both dictionaries

So there is good news after all as well.

funderburkjim commented 7 years ago

Change Page break coding (minor change)

The page breaks in PW are of the form ƒPage2.176-2ƒ. These are changed to [Page2.176-2], so the delimiters are square brackets. This is the form of other dictionaries.

funderburkjim commented 7 years ago

unexpected data in homonym area

The usual coding for a homonym number in the prior form of the digitization is ^1 in this example:

<H1>000{a}1{a}^1¦ ‹Pron. der 3ten Person. Davon›    ETC
GENERAL FORM
<H1>XXX{KEY1}1{KEY2}^H¦   [rest of line]

However, in 13 cases, we have either (?) or (!) instead of ^H :

<H1>100{gOtamasa}1{gOtamasa}(?)¦ •Adj. ‹mit› #{arka} •m. ‹Name verschiedener †Sa7man.› PW37110
<H1>100{catu}1{catu}(!)¦ •Adj. {%der vierte%} ¯TAITT.A7R.1,8,4. PW38575
<H1>100{cIraRIya}1{cIraRIya}(?)¦ {%ein best. Spiel%} ¯Ind.St.15,419.  PW40368
<H1>100{jaNgapUga}1{*jaNgapUga}(!)¦ •m. {%wickedness , sin.%} PW41343
<H1>100{jalambala}1{*jalambala}(!)¦ •n. ²1) {%a stream.%} ²2) {%Collyrium.%} PW42139
<H1>100{jinendraBUti}1{jinendraBUti}(?)¦ •m. = #{jinendrabudDi}. PW42840
<H1>100{jEhmAkani}1{jEhmAkani}(!)¦ •m. ‹desgl.› •Pl. {%sein Geschlecht.%} PW43275
<H1>000{qambura}1{qambura}(?)¦ ¯HEMA7DRI.4.1,638,10. PW43914
<H1>100{tAndana}1{*tAndana}(!)¦ •m. {%Wind.%} PW45207
<H1>100{tigala}1{tigala}(?)¦ •m. ‹N.pr. eines Mannes.› PW45672
<H1>100{tilakanija}1{tilakanija}(!)¦ •m. •Pl. ‹N.pr. eines Volkes.› PW45921
<H1>100{viWaNka}1{*viWaNka}(!)¦ •Adj. {%bad , vile.%} PW102347
<H1>100{viTUtistotra}1{viTUtistotra}(!)¦ •n. ‹Titel eines †Stotra› ¯BURNELL,T. PW102520

This appears to be some sort of 'editorial' comment by the author regarding the headword

In the prior coding of these cases, this (?!) was put in the homonym field of the xml form. However, this makes no sense. For now, I'm going to simply add it as part of the 'key2' field. In effect the change

OLD:
<H1>100{gOtamasa}1{gOtamasa}(?)¦ 
NEW:
<H1>100{gOtamasa}1{gOtamasa(?)}¦

This is similar to the way some other special symbols, such as * are currently shown as part of the key2 field in PW, such as:

<H1>100{aMSakaraRa}1{*aMSakaraRa}¦ •n. {%Theilung%} PW8

gasyoun commented 7 years ago

Maybe it's time to kill ¦ as well, or no need?

funderburkjim commented 7 years ago

Question on 3rd form of 'tva'

Is the highlighted mark a smudge or an accent? If accent, which kind (udAtta, anudAtta, svarita)?

funderburkjim commented 7 years ago

We could kill ¦ in the displays if it is deemed offensive. But I think we should keep it in the digitization. Reason: it helps identify the part of the text that is viewed as the headword.

SergeA commented 7 years ago

Question on 3rd form of 'tva'

त्व॒, तु॒अ॒ tva̱, tu̱a̱ 3 anudattas. They mean this word is not accented (cann't be accented).

funderburkjim commented 7 years ago

non-italic German: markup needed?

In the print, German text can italicized or not (see above).

In the digitization pw.txt, the italic text is coded in the same way as in all the dictionaries {%X%}.

However, the non-italic German text is also delimited by markup: ‹X› (Single left/right-pointing angle quotation mark). For example: ‹Ein Relativum vor dem zweiten› for part of the above 'tva' example.

This coding of non-italic text is, I think, unique to the pw.txt digitization; it does not occur in pwg.

I think we should drop markup of non-italic German text.

Comments?

funderburkjim commented 7 years ago

3 anudattas

Have you noticed any other anudatta accents in pw dictionary? The pw.txt digitization codes udatta and svarita, but I don't identify any instances of anudatta being coded. It would seem odd if anudatta occurs only in this entry, that's why I ask if there are other instances in the print.

funderburkjim commented 7 years ago

`<sic>`

There are 28 cases where the <sic> markup appears in the pw.txt digitization. Based on examination of the first couple of cases, this is markup that Thomas added -- it is not part of the printed text.

My inclination is to delete the 'sic' markup.

Here is the common intepretation of 'sic'.

A Latin word for “thus,” used to indicate that an apparent error is part of quoted material and not an 
editorial mistake: 
“The learned geographer asserts that 'the capital of the United States is Washingtown [sic].'”

However, at least in some of the cases, Google translate of the prior word(s) shows no obvious error.

It would be good for someone to check all these cases, and evaluate whether any of the 'sic' coding should be retained.

Here is a raw listing of the lines in pw.txt containing <sic>. If you examine any of these cases, please feel free to edit the gist in whatever way seems helpful.

SergeA commented 7 years ago

Have you noticed any other anudatta accents in pw dictionary?

No, I didn't, 'cause I never use this dic. But a quick revision brings more examples: मे॒1; नौ॒1 ; व॒1 ; च॒1

gasyoun commented 7 years ago

We could kill ¦ in the displays if it is deemed offensive. But I think we should keep it in the digitization.

Agree and understand.

This coding of non-italic text is, I think, unique to the pw.txt digitization; it does not occur in pwg.

Ok, but is it strong enough argument? If I want to search the German only part, will I be able to do it after killing the markup as well?

gasyoun commented 7 years ago

The raw listing needs attention of @fxru or @zaaf2 .

funderburkjim commented 7 years ago

anudAtta not coded

Thanks for other examples of anudAtta in print.

Looking 'nO' example, we see that the printed text shows anudAtta for homonym 1 and udAtta for homonym 2.

In the digitization, the udAtta is coded, but not the anudAtta.

I double-checked the original digitization from Thomas, and there also anudAtta is not coded.

This is a new fact about the the pw.txt digitization.

funderburkjim commented 7 years ago

word group code

In the digitization, there is a 3-digit numeric code for each entry. This code was introduced by Thomas, using some undocumented algorithm; it is not part of the printed text. The meta-line format of the digitization maintains this code, for the sake of information preservation. Here is how the coding looks in the original digitization and in the meta-line conversion:

ORIGINAL
<H1>000{a}1{a}^1¦ ‹Pron. der 3ten Person. Davon›
<H1>100{aMSitA}1{aMSitA}¦ •f. {%das Erbesein , ...
CURRENT  (the <e> field in the meta-line)
<L>1<pc>1001-1<k1>a<k2>a<h>1<e>000
<L>19<pc>1001-1<k1>aMSitA<k2>aMSitA<e>100

summary of codes

Here is a summary of the codes, the frequency of occurrence, and likely meanings for two codes. The meaning of the other codes is not known.

code	frequency	meaning
000	3707
001	613
004	420
100	126461	Adjective or Noun
107	735
108	561
500	3286	Verb
501	1
999	3

While the idea of classifying entries by something like part of speech is useful, my preliminary impression is that these codes need improvement to be of real use. But such improvement is a task for another day.

funderburkjim commented 7 years ago

missed Greek

@jmigliori Here is a case where Greek text was missed in PW, due to a digitization error. Would you fill it in?

jmigliori commented 7 years ago

ταῦρος

funderburkjim commented 7 years ago

@jmigliori Got it. Thank you!

funderburkjim commented 7 years ago

Markup normalization, part 1

Here are the first batch of markup normalization changes for pw.txt. The idea is to make the markup

more similar to other dictionaries where practicable,
simpler to parse (e.g. for <ls>)
avoid non-informative or duplicative markup (e.g. PW# and ellipsis)

Here are the current categories:

Reformat [Page]: [Page1.001-2] -> [Page1001-2] (remove the period after the volume digit)
Remove 'PW#' codes. At the end of each entry in the original pw digitization, Thomas put a sequence number in the form PW<number>. These are almost identical to the current L-number (Cologne record identifier), which appears in the 'meta' line of the new format. For example:
```
<L>20<pc>1001-1<k1>aMSI<k2>aMSI<e>100
OLD:
{#aMSI}¦ •Adv. ‹mit› #{kar} {%theilen.%}  PW20
NEW:
{#aMSI}¦ •Adv. ‹mit› #{kar} {%theilen.%}
<LEND>
```
<g>X</g> -> <lang n="greek">X</lang> <R>X</R> -> <lang n="russian">X</lang> <A>X</A> ->X`
#{X} -> {#X#} pw.txt markup for Devanagari (in SLP1). Bring coding to form of other dictionaries.
¯X -> <ls>X</ls> Literary source coding. The <ls> form is used in MW, and ACC currently.
… -> (ellipsis character to space character). In earlier digitizations (notably MW, pw, pwg), Thomas used the ellipsis as a sort of 'sticky space', meaning that words joined by ellipses stick together into some kind of semantic unit. However, this markup is erratically used, and not currently useful.
<sic> -> <sic/> The slash to indicate well-formed empty xml element.

gasyoun commented 7 years ago

In the digitization, the udAtta is coded, but not the anudAtta.

Cool.

Remove 'PW#' codes. At the end of each entry in the original pw digitization, Thomas put a sequence number in the form PW. These are almost identical to the current L-number (Cologne record identifier), which appears in the 'meta' line of the new format.

Kill 'em?

However, this markup is erratically used, and not currently useful.

Was not aware. It was not documented in MW.

drdhaval2785 commented 7 years ago

A potential explanation about need of encoding anudAtta in single vowel words.

As per Sanskrit grammar, in a word, whatever is not udAtta or svarita is treated as anudAtta (अनुदात्तं पदमेकवर्जम्). So when someone marks some vowel as udAtta or svarita, we understand that rest are anudAtta.

Single vowel words present a separate difficulty. So they need to be marked.

gasyoun commented 7 years ago

<H1>501{khid}1{khid}¦ ²1) #{khidaªti} ‹und› #{*khindati} {%*drücken…,…niederdrücken…,…betrüben.%} ²2) #{*khintte…,…khidyate…,…khidyati} ‹und› #{khidati} (¯BHA7G.P.). {%sich…gedrückt…fühlen…,…niedergeschlagen…sein…,…sich…Etwas…zu…Herzen…nehmen…,…eine…Qual…empfinden…;…eine…Ermüdung…~…,…eine…Erschlaffung…verspüren.%} #{khinna} {%niedergedrückt…,…niedergeschlagen…;…ermüdet…,…erschlafft.%}

khid is dhatu, so I think 501 is an error and should be 500.

107 and 108 are Adj. and noun as well, but can't grasp the difference with 100.

999 I guess the coder forgot and made something unique so he can kill it after.

000, 001, 004 Adj. and noun as well. Was thinking that some occur only as last part of a composita or contain and upasarga, but not.

funderburkjim commented 7 years ago

Kill em? [PW codes]

Yes, that's what I'm doing. They are duplicative of the L-number codes.

I think 501 is an error

Since there's only one, and it looks like a root, I agree. Will make the change.

funderburkjim commented 7 years ago

anudatta in single-vowel words.

@drdhaval2785 mentions that this category of words is where an explicit anudatta marking might occur in the printed text of PW. By a quick examination of the scan for these headwords, we could decide if the digitization is missing an anudAtta coding or not. This would be simple, though time-consuming.

For possible future reference, the 'single_vowel_words.txt' file in this gist shows the meta-line for all headwords in pw.txt with just one vowel in the 'key1' headword spelling.
Also, the last field says (by examining key2) field whether the vowel is coded with an accent or not.

3022 single vowel words
svarita 4
udAtta 290
anudAtta 1   (This is 'tva\'  mentioned above)
NOACCENT 2727

1948 of these have group code '500', so are probably roots.

SergeA commented 7 years ago

Examples of anudatta in multi-syllable words. e̱na̱ 1 e̱na̱ 2 und e̱nā̱ ba̱ta̱ 1

Also I suppose all the headwords described as "enklitischer" must be marked with anudattas. However the word vas 1 enklitischer is not marked. Maybe print error.

funderburkjim commented 7 years ago

Question: meaning(s) of `*` in PW

There are many instances in PW where words are preceded by an asterisk. Sometimes the word is Sanskrit, sometimes the word is German.

Example from Page 1:

*What is the significance (meaning) of all these `` ?**

Is the usage documented by the author somewhere in the Front Matter for PW ?

gasyoun commented 7 years ago

Had to rearead the prefaces.

If * before a literary source - it is quoted rarely. In Preface of 1st volume PWG (1855).

(Gedruckte Werke aus der Sanskrit-Literatur, die nur ganz gelegentlich citirt werden, sind mit einem Sternchen bezeichnet.)

If * before a Sanskrit word, German meaning - it is "invented" by a grammarian or lexicographer and is not met in any other literary source. In Preface of 1st volume PWK (1879).

Ein Wort, eine Bedeutung, eine Construction oder ein Genus, die bis jetzt nur von Grammatikern oder Lexicographen aufgeführt werden, sind mit * bezeichnet worden.

I would ask you Jim to add markup to such case as well:

Zwei Zahlen ohne Angabe eines Buches verweisen auf die zweite Auflage
meiner Chrestomathie.

Two digits without mentioning of a book = Boethling's Chrestomathie 2nd ed.

Der am Ende eines Titels in Klammern stehende Name bezeichnet den Gelehrten, der die Beiträge für dieses Wörterbuch aus dem angegebenen Buche ganz oder zum grössten Theile geliefert hat.

Surnames (Capeller, Delbruck, Garbe, Geldner, Jolly, Leskien, Muir, Pischel, Schiefner, Schroder, Windisch; Kern, Weber, Stenzler - are they there?) after literary sources mean that they were provided by that scholar.

SergeA commented 7 years ago

If * before a Sanskrit word, German meaning - it is "invented" by a grammarian or lexicographer and is not met in any other literary source. In Preface of 1st volume PWK (1879).

Not necessarily "invented". Bohtlingk does not say so. Just "not met". For the readers this means: be careful! there is a danger of a false word. This corresponds to MW's mark "L."

funderburkjim commented 7 years ago

Thanks for clarification of meaning of asterisk.

Regarding markup, the digitization has those 'asterisks'. That seems adequate currently. For instance, one could search for all headwords whose key2 begins with an asterisk.

funderburkjim commented 7 years ago

Question regarding italic/non-italic in PW.

German text within PW appears sometimes italicized, sometimes non-italicized. What is the meaning of the difference? My suspicion is that the italicized text is in the nature of a translation of the sense of a word; while the non-italic text pertains to meta-information about the word, such as details about its grammatical forms or the forms of other words used with the word.

See also the question above -- I still think the digitization markup of non-italic German is superfluous, and am inclined to remove it in this round of housekeeping.

funderburkjim commented 7 years ago

Subdivisions in PW entries, part 1

There are several kinds of subdivisions within the PW text. In the digitization, Thomas did a lot of work to identify and mark these subdivisions. The main problem I have with the markup is that it is obscure. In the construction of the xml form and the display of the xml form, these markups have been converted to more usable forms. In this present work on the pw.txt digitization, I'm pushing this change of notation down to pw.txt.

The most prevalent kind of subdivisions appear in the text as 1) etc., a) etc. and α) etc.. The number sequence is top level; a letter sequence of subsections may be embedded within a particular one of the number sequence sections; and occasionally a sequence of greek-letter subdivisions may be embedded within a particular one of the letter sequence sections. It is like an outline with three layers of indentation.

Type	Thomas notation	xml notation
number	²1) ²2), etc	`<div n="1">— 1)`
letter	³a), ³b), etc	`<div n="2">— a)`
greek	¹a), ¹b), etc.	`<div n="3">— α)`,

Notes:

The printed text has the mdash — before all but the first member of a subsection. For instance, 1) xxxx — 2) xxxxx — 3) xxxxx. Thomas' notation did not include any of the mdashes. The xml notation includes the mdash even for the first member of the sequence, e.g. <div n="1">— 1) xxxx <div n="1">— 2) xxxx <div n="1">— 3) .
For the Greek, Thomas used Latin alphabet to represent the Greek letters. The xml notation shows the Greek letters directly.
The n="1", n="2", n="3" attributes of the div element in the xml notation are not strictly necessary, since the could be inferred by analyzing X in the following pattern following the div tag: — X) (e.g., if X is a digit sequence, then n must be '1', etc. However, using the attribute makes the processing easier.
The closing 'div' tag is not present in the pw.txt. It will be added in the construction of the pw.xml, the xml form of the digitization.
The <div> elements always start on a new line. This is not strictly necessary, but adding these line breaks makes individual lines of the pw.txt digitization shorter and more coherent, and thereby makes this form of the digitization easier to understand.

Here is a before/after example of the markup changes thus far, including the divisions. This is for entry aMSumant.

OLD:  (one line in pw.txt -- I've split it for the purpose of this comment)
<H1>100{aMSumant}1{aMSuma/nt}¦ ²1) •Adj. ³a) {%reich an †Soma-Pflanzen oder -Saft.%} ³b) 
{%faserig.%} ³c) {%strahlenreich.%} ²2) •m. ³a) {%die Sonne%} ¯250,18. ³b) ‹N.pr.› ¹a) ‹verschiedener 
Männer› ¯106,18. ¹b) ‹eines Berges.› ²3) •f. #{°matI} ³a) {%®Hedysarum_gangeticum.%} ³b) ‹N.pr. eines 
Flusses.› PW30

NEW:  (including also the enclosing 'meta' lines
<L>30<pc>1001-2<k1>aMSumant<k2>aMSuma/nt<e>100
{#aMSuma/nt#}¦ 
<div n="1">— 1) •Adj. 
<div n="2">— a) {%reich an †Soma-Pflanzen oder -Saft.%} 
<div n="2">— b) {%faserig.%} 
<div n="2">— c) {%strahlenreich.%} 
<div n="1">— 2) •m. 
<div n="2">— a) {%die Sonne%} <ls>250,18.</ls> 
<div n="2">— b) ‹N.pr.› 
<div n="3">— α) ‹verschiedener Männer› <ls>106,18.</ls> 
<div n="3">— β) ‹eines Berges.› 
<div n="1">— 3) •f. {#°matI#} 
<div n="2">— a) {%®Hedysarum_gangeticum.%} 
<div n="2">— b) ‹N.pr. eines Flusses.›
<LEND>

funderburkjim commented 7 years ago

Subdivisions in PW entries, part 2

The second major subdivision of entries is for the prefixed form of roots. Rather than having a separate entry for gam, upagam, udgam, etc., this dictionary has one entry for gam, and then a slew of subdivisions of the gam entry for the different prefixed forms of gam. Again, Thomas has done most of the hard work of identifying and marking these prefix subdivisions. I'm merely changing the notation.

Type	Thomas notation	xml notation
prefix	`<+> {#nis#}`	`<div n="p">— Mit {#nis#}`

Full example, of root akz:

OLD: 
<H1>500{akz}1{akz}¦ , #{*akzati} ‹und› #{*akznoti} , •Partic. #{azwa} ²1) {%erreichen , erlangen%}: #{AkzARa/}. ²2) {%durchdringen , erfüllen.%}
<+> #{nis} {%entmannen , verschneiden.%}
<+> #{sam} ‹(› #{akzase}) {%durchdringen.%} PW249

NEW:  (so far)
<L>249<pc>1003-3<k1>akz<k2>akz<e>500
{#akz#}¦ , {#*akzati#} ‹und› {#*akznoti#} , •Partic. {#azwa#} 
<div n="1">— 1) {%erreichen , erlangen%}: {#AkzARa/#}. 
<div n="1">— 2) {%durchdringen , erfüllen.%}
<div n="p">— Mit {#nis#} {%entmannen , verschneiden.%}
<div n="p">— Mit {#sam#} ‹(› {#akzase#}) {%durchdringen.%}
<LEND>

funderburkjim commented 7 years ago

Subdivisions in PW entries, part 3

Thomas marks several other likely subdivisions. These occur much less frequently, and there is more variations in some of the details of markup. Also, some of these markup patterns have been identified as divisions by me, and may be unwarranted. All of these have been marked as division type 'm' (for miscellaneous); there are 2622 of them at this writing.

Type	Thomas notation	xml notation	count
Causal	`<Caus.>`	`<div n="m">— •Caus.`	1750+
Intensive	‹--› ‹•Intens.›	— •Intens.	280
See	‹--› ‹Vgl.›	— •Vgl.	105
Partic ?	‹--› ‹Partic.›	`<div n="m">— •Partic.`	35
Incorrect for	‹--› ‹Fehlerhaft für›	— Fehlerhaft für	5

Notes:

I don't know the word for which Partic. is an abbreviation. GUESS: Participle?
- Is there a location in the front matter or elsewhere that lists such abbreviations?
- Is it appropriate to consider this a 'subdivision' ?
Fehlerhaft für
- Suspicious that there are only 5; the word Fehlerhaft occurs 88 times.
- Is it appropriate to consider this a 'subdivision' ?
There are still 850 instances of mdash (coded as ‹--›). I don't think these are involved in subdivisions, but it is hard to say. My current intent is just to translate these ‹--› to the unicode mdash — without including division markup.

gasyoun commented 7 years ago

The xml notation shows the Greek letters directly.

Converted on the fly?

However, using the attribute makes the processing easier.

Exactly.

makes individual lines of the pw.txt digitization shorter

And that's important. Too many difficulties already, we do not want to increase them.

•Adj.

What is the • and why it's left?

a separate entry for gam, upagam, udgam

Can we have a full list of the sopasarga forms now, Jim?

I don't know the word for which Partic. is an abbreviation. GUESS: Participle?

Yes, all kinds of participles. Not only udakta, but also adAna, dRzyamAna.

Is there a location in the front matter or elsewhere that lists such abbreviations?

No. There are only literary sources. The rest was obvious in 1850. Is it not so for you, Jim? :fallen_leaf:

Suspicious that there are only 5; the word Fehlerhaft occurs 88 times.

I have researched it in the past. There are many words used with the same meaning. One of such (not popular, but still), lies.

funderburkjim commented 7 years ago

What is the • and why it's left?

It is markup added by Thomas. I think it is attached to words which are abbreviations of Grammatical terms. I'll be changing these to <ab>Adj.</ab> (standard notation for abbreviations), and will generate a list at that time.

full list of the sopasarga forms ?

Should be readily generated from the new form. Since the general pattern will be like the 'n="p"' pattern shown above.

Yes, all kinds of participles.

Thanks for info.

No list of abbreviations like Partic. by author.

Too bad. But an opportunity for us to generate a list that will aid modern readers.

gasyoun commented 7 years ago

Great.

Should be readily generated from the new form.

Hurray!

But an opportunity for us to generate a list that will aid modern readers.

Exactly.

funderburkjim commented 7 years ago

IAST conversion, list to check

There are many aspects that have arisen in the course of converting the AS (number-letter) coding within pw.txt to modern IAST. I'm classifying the contexts in which AS coding occurs in three parts:

in literary source abbreviations. There are about 45,000 distinct cases here. Will defer discussion here until coordinating the new digitization with our previous work.
In text preceded (in the original digitization) by † (unicode DAGGER). This dagger seems to indicate a particular font-type and spacing in the printed text. There are about 4900 different words so marked.
Other. There about 120 distinct cases here; in fact many of these also appear with a dagger, but with an intervening ''; example `†Karṇa`. So they are much like the second class.

Although I've made about 300 miscellaneous corrections in the course of the work thus far, I'm sure that there are other spelling errors in the last two (non-<ls>) groups. There is a gist list iast_check1.txt .

The list has 4998 cases (a small number of these are duplicates). The ✓ in a case indicates that the spelling is probably correct, since the spelling (when converted to SLP1) appears as a headword in pw. There are 2933 of these, and 2065 cases marked TODO. Each case also shows the frequency of occurrence within pw.txt. Taking frequencies into account, there are 26818 text instances that are DONE (marked with ✓), and 4910 that are marked TODO and remain to be checked (so about 15% of the words are unaccounted for among the pw headwords.

The TODO items are further divided into frequently occurring spellings (3 or more instances), and these are marked with an asterisk: TODO*. These 255 cases are the most important, in the sense that they account for nearly half of the instances.

There are also some fairly obvious mis-spellings (obviousness is in the eye of the beholder) (e.g., variants of Kṛṣṇa),

Correction by eye

It would be good to get many of the TODO cases examined and corrected by eye.

Some of the words are spelled correctly (like Durgā : 449 : TODO* : (PW has this under the adjective Durga); these can be marked as OK: Durgā : 449 : TODO* : OK .
Words which are (almost certainly) mis-spelled can be marked with the correct spelling, such as Durgá : 1 : TODO : Durgā (The accent was no-doubt a mis-reading of the circumflex, which is what pw printed text uses for long vowels in his peculiar IAST.)
A very small third category of obvious cases includes German words. I think Aussehen : 1 : TODO : is of this type; maybe the solution can be to flag as OTHER: Aussehen : 1 : TODO : OTHER.

These corrections could be made directly within the gist list, or in a local copy of the gist if that's more convenient. I could make the gist 'Public' if that seems helpful.

If @SergeA has some time to examine, he can probably do many of the TODO cases quite readily. Others are welcome to join in!

There will certainly be some which can't be determined 'by eye' -- they will need a UI type environment so that the print and context can be examined readily. Maybe I'll consider such a UI when the obvious cases are handled.

gasyoun commented 7 years ago

I could make the gist 'Public' if that seems helpful.

Yes.

funderburkjim commented 7 years ago

Gist list should now be public. I guess that means it is open to collaboration and editing. Someone else should give a try to editing it.

funderburkjim commented 7 years ago

Question re an addition by Thomas

Under headword 'yadi', Thomas expanded an abbreviation M. to Mānavadharmaśāstra.

Current coding is:

<div n="1">— 1) {%wenn%} , ‹mit Indic. , Conj. , Pot. und Fut. in der älteren Sprache› ; ‹gewöhnlich 
einfacher Nachsatz ohne Partikel.› {#ya/di cit , yadi ha vE , ya/dI/t , ya/dyu#} ‹(35 , 25.36 , 23)› , {#yadyu 
vE#}. ‹In den späteren Werken (von <is1>Mānavadharmaśāstra</is1> an)›

Should we leave the expansion or revert to the M. of the print ?

funderburkjim commented 7 years ago

Abbreviation expansion

Here is another example where Thomas expanded an abbreviated word. Clearly the two J. of the printed text refer to the previous Jaǵus (modern iast Yajus). In this case, I think it may be helpful to leave the expansion. What do others think?

From pw.txt: {#yajuzwa/s#}¦ •Adv. {%von Seiten des †Jag4us , in Beziehung auf das †Jag4us , im Gebiete des †Jag4us%} <ls>21,2.</ls> <ls>A7PAST.C2R.9,16,4.</ls>

Apparently Thomas did a lot of editing of the pw digitization back in 2005 or so; since the resulting digitization gives no typographical clue with regard to such abbreviation expansions, there's no systematic way to search for them. But it may be helpful to know this feature of the digitization, when, as here, we stumble upon such a case.

gasyoun commented 7 years ago

Should we leave the expansion or revert to the M. of the print ?

I would stay with print.

In this case, I think it may be helpful to leave the expansion.

One case against 45k does not change a bit. I would not mix.

Apparently Thomas did a lot of editing of the pw digitization back in 2005

That's interesting. If there are hundreds of such, it would be a pity to kill them, but if rare...

funderburkjim commented 7 years ago

Reverted expansions of M. and J. back to M. and J., in agreement with text. @gasyoun Thanks for feedback.

funderburkjim commented 7 years ago

Abbreviations marked with •

The check_dot list contains items in the pw.txt digitization that Thomas marked with the • (unicode BULLET) character. There are only 77 distinct entries in this file, but 200,000 or so instances of these in the digitization.

The Wikipedia article on German abbreviations has several of these.
Maybe @gasyoun could provide German and English Translations for these, which could then be used as Tooltips in the displays.

The intent is to change the markup of these in pw.txt to the xml form <ab>X</ab>.

In doing so, there are several subquestions.

form `•*X`

Many of these abbreviations occur in two forms in the list: •X and •*X. From discussion above of meaning of *, the * is really a separate piece of information, which says something about the legitimacy of the following word. So, I think it would be more accurate if Thomas had used the coding *•X since the * is commenting on the abbreviation.
Thus, the proposed coding of •*X is *<ab>X</ab>.

•gaṇa

Occurs 603 times. This is not an abbreviation. An example usage is {#*aRIva#}¦ •gan2a {#zuBrAdi#}. which says, I think, that the word aRIva is in the word-collection zuBrAdi.
This gaṇa information is also present in MW, and probably several other dictionaries. However we have not developed markup for it in any dictionary. I would like to find the source document where these gaṇas are defined; presumably, the other words in a particular gana would shed light on the possible meaning of the particular word.

In the example above, the headword itself is marked with asterisk; and indeed almost all of the instances of •gaṇa are for headwords similarly marked with asterisk. It is interesting to contrast this usage with that of MW. When searching for 'gana' in the text of MW (Advanced Search), the first example is 'ajasraM' ind. perpetually, for ever, ever. [गण स्वर्-आदि, &c ]. But in pw.txt, under 'ajasra', there is no mention of a gana.

My inclination as to how to recode gana is:

Recode •gaṇa as <is>gaṇa</is> (The <is> tag is being used to identify the iast-sanskrit words in pw.txt)
Defer further markup in pw.txt until a later time when we enhance the markup of gana information in the various dictionaries.

•Patron. and •Patronn.

There are only 3 instances of •Patronn. (patronymic). I think these should be considered print errors in favor of the more common 1-n version.

•Beinn. and •*Bein.

There are only 7 •Beinn. instances. I think these should be changed in favor of •Bein.

•»s.u.

I think the » should be dropped - it represents nothing in the printed text, and may have been added just because the grapheme » appears to be pointing to something, and abbreviation s.u. in German means roughly to 'see under' (some following headword).

Should m,f,n, and Adj be marked with `<lex>`

By far the most common abbreviations in check_dot are •Adj. , •m., •f. , •n.. In MW, the gender information for nominals is marked with <lex> (e.g. <lex>m.</lex>). Perhaps we should use this <lex> tag in PW for these 4 abbreviations.

What do others think?

•adj. (lower-case)

Nearly all of these occur followed by Comp. Example under hw aMSukAnta, {#aMSukAnta#}¦ •m. {%Zipfel eines Gewandes , ~ Tuches%} <ls>296,10</ls> ‹(am Ende eines› •adj. ‹Comp.› •f. {#A#}).

Maybe this pair of words should be a separate abbreviation: <ab>adj. Comp.</ab> I'm unsure of the meaning.

gasyoun commented 7 years ago

I think it would be more accurate if Thomas had used the coding *•X

Agree

source document where these gaṇas are defined

Dhaval? @drdhaval2785

I think these should be considered print errors in favor of the more common 1-n version.

Agree.

I think these should be changed in favor of •Bein.

Agree, Bein = a leg, and there is no such word as Beinn and never was.

I think the » should be dropped

Hmm, maybe not dropped, but moved to XML? Because it would later give as hint with hyperlinks - what can and should be linked.

Perhaps we should use this tag in PW for these 4 abbreviations.

Totally agree.

adj. Comp.

Might be a good idea. The am Ende (2136 cases) text means, that when the headword is a 2nd part of a word, it has this ending. Anyway it makes more sense than just <gram n="adj">adj.</gram> <noti>Comp.</noti> Am Anfange is the opposite (160 cases, strange, I would suppose there are thousands of them). And I found a combination of both as well, am Anfange und am Ende einiger Compp.. What I see here is Compp. instead of expected Comp.

funderburkjim commented 7 years ago

check_dot4.txt

check_dot4.txt in gist shows distribution of markup as described above:

<ab> The ones marked as simple abbreviations
<lex> The ones marked as lexical categories (gender/adj).
- Note, for now, treating lower-case 'adj.' same as upper-case 'Adj.'
<is> This is the coding just for gaṇa (not treated as abbreviation).

asterisks moved outside the markup.

For instance, under headword aMSaka: <div n="1">— 2) *<lex>n.</lex> {%Tag.%}

regarding 's.u.'

It is marked as an abbreviation <ab>s.u.</ab> and the special » character removed. I don't think removing » causes any information loss. We still will be able to analyze such cases further, by examining the following word Here is an example: {#aRvI#}¦ <ab>s.u.</ab> {#aRu#}.

drdhaval2785 commented 7 years ago

document where these gaṇas are defined;

https://github.com/drdhaval2785/SanskritVerb/blob/master/Data/gaNapATha_SLP.txt

funderburkjim commented 7 years ago

@drdhaval2785 Could you write a 'readme' type file that explains how to read the gaNapATha file?

For instance, in the PW example given above {#*aRIva#}¦ <is>gaṇa</is> {#zuBrAdi#}., what is the list corresponding to zuBrAdi. Similarly, from the MW example, how to find स्वर्-आदि gaṇa ?

drdhaval2785 commented 7 years ago

There is not much to write to readme.

I will write it here itself.

SuBrAdiByaSca 4.1.123

SuBra","vizwapura","brahmakfta","SatadvAra","SatAvara","SatAvara","SalAkA","SAlAcala","SalAkABrU","leKABrU","vimAtf","viDavA","kiMkasA","rohiRI","rukmiRI","diSA","SAlUka","ajabasti","SakanDi","lakzmaRaSyAmayor vAsizWe","goDA","kfkalAsa","aRIva","pravAhaRa","Barata","BArama","mukaRqu","maGazwu","makazwu","karpUra","itara","anyatara","AlIQa sudatta","sucakzas","sunAman","kadru","tuda","akASApa","kumArIkA","kiSorikA","kuveRikA","jihmASin","pariDi","vAyudatta","kakala","KawvA","ambikA","aSokA","SudDapiNgalA","KaqonmattA","anudfzwi","jaratin","bAlavardin","vigraja","vIja","Svan","aSman","aSva","ajira|

Here the first line is the Astadhyayi rule and number which refers to this gaNa. Second line is blank. Third line is list of words in that gaNa.

I am not sure from where I got this file.

gasyoun commented 7 years ago

Third line is list of words in that gaNa.

Every ganapatha has same number of lines?

sanskrit-lexicon / COLOGNE