Closed funderburkjim closed 7 years ago
same coding conventions in both dictionaries
So there is good news after all as well.
The page breaks in PW are of the form ƒPage2.176-2ƒ
. These are changed to [Page2.176-2]
, so the
delimiters are square brackets. This is the form of other dictionaries.
The usual coding for a homonym number in the prior form of the digitization is ^1
in this example:
<H1>000{a}1{a}^1¦ ‹Pron. der 3ten Person. Davon› ETC
GENERAL FORM
<H1>XXX{KEY1}1{KEY2}^H¦ [rest of line]
However, in 13 cases, we have either (?)
or (!)
instead of ^H
:
<H1>100{gOtamasa}1{gOtamasa}(?)¦ •Adj. ‹mit› #{arka} •m. ‹Name verschiedener †Sa7man.› PW37110
<H1>100{catu}1{catu}(!)¦ •Adj. {%der vierte%} ¯TAITT.A7R.1,8,4. PW38575
<H1>100{cIraRIya}1{cIraRIya}(?)¦ {%ein best. Spiel%} ¯Ind.St.15,419. PW40368
<H1>100{jaNgapUga}1{*jaNgapUga}(!)¦ •m. {%wickedness , sin.%} PW41343
<H1>100{jalambala}1{*jalambala}(!)¦ •n. ²1) {%a stream.%} ²2) {%Collyrium.%} PW42139
<H1>100{jinendraBUti}1{jinendraBUti}(?)¦ •m. = #{jinendrabudDi}. PW42840
<H1>100{jEhmAkani}1{jEhmAkani}(!)¦ •m. ‹desgl.› •Pl. {%sein Geschlecht.%} PW43275
<H1>000{qambura}1{qambura}(?)¦ ¯HEMA7DRI.4.1,638,10. PW43914
<H1>100{tAndana}1{*tAndana}(!)¦ •m. {%Wind.%} PW45207
<H1>100{tigala}1{tigala}(?)¦ •m. ‹N.pr. eines Mannes.› PW45672
<H1>100{tilakanija}1{tilakanija}(!)¦ •m. •Pl. ‹N.pr. eines Volkes.› PW45921
<H1>100{viWaNka}1{*viWaNka}(!)¦ •Adj. {%bad , vile.%} PW102347
<H1>100{viTUtistotra}1{viTUtistotra}(!)¦ •n. ‹Titel eines †Stotra› ¯BURNELL,T. PW102520
This appears to be some sort of 'editorial' comment by the author regarding the headword
In the prior coding of these cases, this (?!)
was put in the homonym field of the xml form. However,
this makes no sense. For now, I'm going to simply add it as part of the 'key2' field.
In effect the change
OLD:
<H1>100{gOtamasa}1{gOtamasa}(?)¦
NEW:
<H1>100{gOtamasa}1{gOtamasa(?)}¦
This is similar to the way some other special symbols, such as *
are currently shown as part of the
key2
field in PW, such as:
<H1>100{aMSakaraRa}1{*aMSakaraRa}¦ •n. {%Theilung%} PW8
Maybe it's time to kill ¦ as well, or no need?
Is the highlighted mark a smudge or an accent? If accent, which kind (udAtta, anudAtta, svarita)?
We could kill ¦ in the displays if it is deemed offensive. But I think we should keep it in the digitization. Reason: it helps identify the part of the text that is viewed as the headword.
Question on 3rd form of 'tva'
त्व॒, तु॒अ॒ tva̱, tu̱a̱ 3 anudattas. They mean this word is not accented (cann't be accented).
In the print, German text can italicized or not (see above).
In the digitization pw.txt, the italic text is coded in the same way as in all the dictionaries {%X%}
.
However, the non-italic German text is also delimited by markup: ‹X›
(Single left/right-pointing angle quotation mark). For example: ‹Ein Relativum vor dem zweiten›
for part of the above 'tva' example.
This coding of non-italic text is, I think, unique to the pw.txt digitization; it does not occur in pwg.
I think we should drop markup of non-italic German text.
Comments?
3 anudattas
Have you noticed any other anudatta accents in pw dictionary? The pw.txt digitization codes udatta and svarita, but I don't identify any instances of anudatta being coded. It would seem odd if anudatta occurs only in this entry, that's why I ask if there are other instances in the print.
<sic>
There are 28 cases where the <sic>
markup appears in the pw.txt digitization. Based on examination of the first couple of cases, this is markup that Thomas added -- it is not part of the printed text.
My inclination is to delete the 'sic' markup.
Here is the common intepretation of 'sic'.
A Latin word for “thus,” used to indicate that an apparent error is part of quoted material and not an
editorial mistake:
“The learned geographer asserts that 'the capital of the United States is Washingtown [sic].'”
However, at least in some of the cases, Google translate of the prior word(s) shows no obvious error.
It would be good for someone to check all these cases, and evaluate whether any of the 'sic' coding should be retained.
Here is a raw listing of the lines in pw.txt containing <sic>
. If you examine any of these cases, please feel free to edit
the gist in whatever way seems helpful.
Have you noticed any other anudatta accents in pw dictionary?
No, I didn't, 'cause I never use this dic. But a quick revision brings more examples: मे॒1; नौ॒1 ; व॒1 ; च॒1
We could kill ¦ in the displays if it is deemed offensive. But I think we should keep it in the digitization.
Agree and understand.
This coding of non-italic text is, I think, unique to the pw.txt digitization; it does not occur in pwg.
Ok, but is it strong enough argument? If I want to search the German only part, will I be able to do it after killing the markup as well?
The raw listing needs attention of @fxru or @zaaf2 .
Thanks for other examples of anudAtta in print.
Looking 'nO' example, we see that the printed text shows anudAtta for homonym 1 and udAtta for homonym 2.
In the digitization, the udAtta is coded, but not the anudAtta.
I double-checked the original digitization from Thomas, and there also anudAtta is not coded.
This is a new fact about the the pw.txt digitization.
In the digitization, there is a 3-digit numeric code for each entry. This code was introduced by Thomas, using some undocumented algorithm; it is not part of the printed text. The meta-line format of the digitization maintains this code, for the sake of information preservation. Here is how the coding looks in the original digitization and in the meta-line conversion:
ORIGINAL
<H1>000{a}1{a}^1¦ ‹Pron. der 3ten Person. Davon›
<H1>100{aMSitA}1{aMSitA}¦ •f. {%das Erbesein , ...
CURRENT (the <e> field in the meta-line)
<L>1<pc>1001-1<k1>a<k2>a<h>1<e>000
<L>19<pc>1001-1<k1>aMSitA<k2>aMSitA<e>100
Here is a summary of the codes, the frequency of occurrence, and likely meanings for two codes. The meaning of the other codes is not known.
code | frequency | meaning |
---|---|---|
000 | 3707 | |
001 | 613 | |
004 | 420 | |
100 | 126461 | Adjective or Noun |
107 | 735 | |
108 | 561 | |
500 | 3286 | Verb |
501 | 1 | |
999 | 3 |
While the idea of classifying entries by something like part of speech is useful, my preliminary impression is that these codes need improvement to be of real use. But such improvement is a task for another day.
@jmigliori Here is a case where Greek text was missed in PW, due to a digitization error. Would you fill it in?
ταῦρος
@jmigliori Got it. Thank you!
Here are the first batch of markup normalization changes for pw.txt. The idea is to make the markup
<ls>
)Here are the current categories:
[Page1.001-2]
-> [Page1001-2]
(remove the period after the volume digit)PW<number>
. These are almost identical to the current L-number
(Cologne record identifier), which appears in the 'meta' line of the new format. For example:
<L>20<pc>1001-1<k1>aMSI<k2>aMSI<e>100
OLD:
{#aMSI}¦ •Adv. ‹mit› #{kar} {%theilen.%} PW20
NEW:
{#aMSI}¦ •Adv. ‹mit› #{kar} {%theilen.%}
<LEND>
<g>X</g>
-> <lang n="greek">X</lang>
<R>X</R>
-> <lang n="russian">X</lang>
<A>X</A> ->
#{X}
-> {#X#}
pw.txt markup for Devanagari (in SLP1). Bring coding to form of other dictionaries.¯X
-> <ls>X</ls>
Literary source coding. The <ls>
form is used in MW, and ACC currently.…
->
(ellipsis character to space character). In earlier digitizations (notably MW, pw, pwg), Thomas
used the ellipsis as a sort of 'sticky space', meaning that words joined by ellipses stick together into some kind of semantic unit. However, this markup is erratically used, and not currently useful.<sic>
-> <sic/>
The slash to indicate well-formed empty xml element.In the digitization, the udAtta is coded, but not the anudAtta.
Cool.
Remove 'PW#' codes. At the end of each entry in the original pw digitization, Thomas put a sequence number in the form PW
. These are almost identical to the current L-number (Cologne record identifier), which appears in the 'meta' line of the new format.
Kill 'em?
However, this markup is erratically used, and not currently useful.
Was not aware. It was not documented in MW.
A potential explanation about need of encoding anudAtta in single vowel words.
As per Sanskrit grammar, in a word, whatever is not udAtta or svarita is treated as anudAtta (अनुदात्तं पदमेकवर्जम्). So when someone marks some vowel as udAtta or svarita, we understand that rest are anudAtta.
Single vowel words present a separate difficulty. So they need to be marked.
<H1>501{khid}1{khid}¦ ²1) #{khidaªti} ‹und› #{*khindati} {%*drücken…,…niederdrücken…,…betrüben.%} ²2) #{*khintte…,…khidyate…,…khidyati} ‹und› #{khidati} (¯BHA7G.P.). {%sich…gedrückt…fühlen…,…niedergeschlagen…sein…,…sich…Etwas…zu…Herzen…nehmen…,…eine…Qual…empfinden…;…eine…Ermüdung…~…,…eine…Erschlaffung…verspüren.%} #{khinna} {%niedergedrückt…,…niedergeschlagen…;…ermüdet…,…erschlafft.%}
khid
is dhatu, so I think 501 is an error and should be 500.
107 and 108 are Adj. and noun as well, but can't grasp the difference with 100.
999 I guess the coder forgot and made something unique so he can kill it after.
000, 001, 004 Adj. and noun as well. Was thinking that some occur only as last part of a composita or contain and upasarga, but not.
Kill em? [PW codes]
Yes, that's what I'm doing. They are duplicative of the L-number codes.
I think 501 is an error
Since there's only one, and it looks like a root, I agree. Will make the change.
anudatta in single-vowel words.
@drdhaval2785 mentions that this category of words is where an explicit anudatta marking might occur in the printed text of PW. By a quick examination of the scan for these headwords, we could decide if the digitization is missing an anudAtta coding or not. This would be simple, though time-consuming.
For possible future reference, the 'single_vowel_words.txt' file in this gist shows the
meta-line for all headwords in pw.txt with just one vowel in the 'key1' headword spelling.
Also, the last field says (by examining key2) field whether the vowel is coded with an accent or not.
3022 single vowel words
svarita 4
udAtta 290
anudAtta 1 (This is 'tva\' mentioned above)
NOACCENT 2727
1948 of these have group code '500', so are probably roots.
Examples of anudatta in multi-syllable words. e̱na̱ 1 e̱na̱ 2 und e̱nā̱ ba̱ta̱ 1
Also I suppose all the headwords described as "enklitischer" must be marked with anudattas. However the word vas 1 enklitischer
is not marked. Maybe print error.
*
in PWThere are many instances in PW where words are preceded by an asterisk. Sometimes the word is Sanskrit, sometimes the word is German.
Example from Page 1:
*What is the significance (meaning) of all these `` ?**
Is the usage documented by the author somewhere in the Front Matter for PW ?
Had to rearead the prefaces.
If * before a literary source - it is quoted rarely. In Preface of 1st volume PWG (1855).
(Gedruckte Werke aus der Sanskrit-Literatur, die nur ganz gelegentlich citirt werden, sind mit einem Sternchen bezeichnet.)
If * before a Sanskrit word, German meaning - it is "invented" by a grammarian or lexicographer and is not met in any other literary source. In Preface of 1st volume PWK (1879).
Ein Wort, eine Bedeutung, eine Construction oder ein Genus, die bis jetzt nur von Grammatikern oder Lexicographen aufgeführt werden, sind mit * bezeichnet worden.
I would ask you Jim to add markup to such case as well:
Zwei Zahlen ohne Angabe eines Buches verweisen auf die zweite Auflage
meiner Chrestomathie.
Two digits without mentioning of a book = Boethling's Chrestomathie 2nd ed.
Der am Ende eines Titels in Klammern stehende Name bezeichnet den Gelehrten, der die Beiträge für dieses Wörterbuch aus dem angegebenen Buche ganz oder zum grössten Theile geliefert hat.
Surnames (Capeller, Delbruck, Garbe, Geldner, Jolly, Leskien, Muir, Pischel, Schiefner, Schroder, Windisch; Kern, Weber, Stenzler - are they there?) after literary sources mean that they were provided by that scholar.
If * before a Sanskrit word, German meaning - it is "invented" by a grammarian or lexicographer and is not met in any other literary source. In Preface of 1st volume PWK (1879).
Not necessarily "invented". Bohtlingk does not say so. Just "not met". For the readers this means: be careful! there is a danger of a false word. This corresponds to MW's mark "L."
Thanks for clarification of meaning of asterisk.
Regarding markup, the digitization has those 'asterisks'. That seems adequate currently. For instance, one could search for all headwords whose key2 begins with an asterisk.
German text within PW appears sometimes italicized, sometimes non-italicized. What is the meaning of the difference? My suspicion is that the italicized text is in the nature of a translation of the sense of a word; while the non-italic text pertains to meta-information about the word, such as details about its grammatical forms or the forms of other words used with the word.
See also the question above -- I still think the digitization markup of non-italic German is superfluous, and am inclined to remove it in this round of housekeeping.
There are several kinds of subdivisions within the PW text. In the digitization, Thomas did a lot of work to identify and mark these subdivisions. The main problem I have with the markup is that it is obscure. In the construction of the xml form and the display of the xml form, these markups have been converted to more usable forms. In this present work on the pw.txt digitization, I'm pushing this change of notation down to pw.txt.
The most prevalent kind of subdivisions appear in the text as 1) etc.
, a) etc.
and α) etc.
. The number sequence is top level; a letter sequence of subsections may be embedded within a particular one of the number sequence sections; and occasionally a sequence of greek-letter subdivisions may be embedded within a particular one of the letter sequence sections. It is like an outline with three layers of indentation.
Type | Thomas notation | xml notation |
---|---|---|
number | ²1) ²2), etc | <div n="1">— 1) |
letter | ³a), ³b), etc | <div n="2">— a) |
greek | ¹a), ¹b), etc. | <div n="3">— α) , |
Notes:
1) xxxx — 2) xxxxx — 3) xxxxx
. Thomas' notation did not include any of the mdashes. The xml notation includes the mdash even for the first member of the sequence, e.g.
<div n="1">— 1) xxxx <div n="1">— 2) xxxx <div n="1">— 3)
.n="1", n="2", n="3"
attributes of the div element in the xml notation are not strictly necessary, since the could be inferred by analyzing X in the following pattern following the div tag: — X)
(e.g., if X is a digit sequence, then n must be '1', etc. However, using the attribute makes the processing easier.<div>
elements always start on a new line. This is not strictly necessary, but adding these line breaks
makes individual lines of the pw.txt digitization shorter and more coherent, and thereby makes this form of the digitization easier to understand. Here is a before/after example of the markup changes thus far, including the divisions. This is for entry aMSumant.
OLD: (one line in pw.txt -- I've split it for the purpose of this comment)
<H1>100{aMSumant}1{aMSuma/nt}¦ ²1) •Adj. ³a) {%reich an †Soma-Pflanzen oder -Saft.%} ³b)
{%faserig.%} ³c) {%strahlenreich.%} ²2) •m. ³a) {%die Sonne%} ¯250,18. ³b) ‹N.pr.› ¹a) ‹verschiedener
Männer› ¯106,18. ¹b) ‹eines Berges.› ²3) •f. #{°matI} ³a) {%®Hedysarum_gangeticum.%} ³b) ‹N.pr. eines
Flusses.› PW30
NEW: (including also the enclosing 'meta' lines
<L>30<pc>1001-2<k1>aMSumant<k2>aMSuma/nt<e>100
{#aMSuma/nt#}¦
<div n="1">— 1) •Adj.
<div n="2">— a) {%reich an †Soma-Pflanzen oder -Saft.%}
<div n="2">— b) {%faserig.%}
<div n="2">— c) {%strahlenreich.%}
<div n="1">— 2) •m.
<div n="2">— a) {%die Sonne%} <ls>250,18.</ls>
<div n="2">— b) ‹N.pr.›
<div n="3">— α) ‹verschiedener Männer› <ls>106,18.</ls>
<div n="3">— β) ‹eines Berges.›
<div n="1">— 3) •f. {#°matI#}
<div n="2">— a) {%®Hedysarum_gangeticum.%}
<div n="2">— b) ‹N.pr. eines Flusses.›
<LEND>
The second major subdivision of entries is for the prefixed form of roots. Rather than having a separate entry for gam, upagam, udgam, etc., this dictionary has one entry for gam, and then a slew of subdivisions of the gam entry for the different prefixed forms of gam. Again, Thomas has done most of the hard work of identifying and marking these prefix subdivisions. I'm merely changing the notation.
Type | Thomas notation | xml notation |
---|---|---|
prefix | <+> {#nis#} |
<div n="p">— Mit {#nis#} |
Full example, of root akz:
OLD:
<H1>500{akz}1{akz}¦ , #{*akzati} ‹und› #{*akznoti} , •Partic. #{azwa} ²1) {%erreichen , erlangen%}: #{AkzARa/}. ²2) {%durchdringen , erfüllen.%}
<+> #{nis} {%entmannen , verschneiden.%}
<+> #{sam} ‹(› #{akzase}) {%durchdringen.%} PW249
NEW: (so far)
<L>249<pc>1003-3<k1>akz<k2>akz<e>500
{#akz#}¦ , {#*akzati#} ‹und› {#*akznoti#} , •Partic. {#azwa#}
<div n="1">— 1) {%erreichen , erlangen%}: {#AkzARa/#}.
<div n="1">— 2) {%durchdringen , erfüllen.%}
<div n="p">— Mit {#nis#} {%entmannen , verschneiden.%}
<div n="p">— Mit {#sam#} ‹(› {#akzase#}) {%durchdringen.%}
<LEND>
Thomas marks several other likely subdivisions. These occur much less frequently, and there is more variations in some of the details of markup. Also, some of these markup patterns have been identified as divisions by me, and may be unwarranted. All of these have been marked as division type 'm' (for miscellaneous); there are 2622 of them at this writing.
Type | Thomas notation | xml notation | count |
---|---|---|---|
Causal | <Caus.> |
<div n="m">— •Caus. |
1750+ |
Intensive | ‹--› ‹•Intens.› | — •Intens. |
280 |
See | ‹--› ‹Vgl.› | — •Vgl. |
105 |
Partic ? | ‹--› ‹Partic.› | <div n="m">— •Partic. |
35 |
Incorrect for | ‹--› ‹Fehlerhaft für› | — Fehlerhaft für |
5 |
Notes:
Fehlerhaft
occurs 88 times.The xml notation shows the Greek letters directly.
Converted on the fly?
However, using the attribute makes the processing easier.
Exactly.
makes individual lines of the pw.txt digitization shorter
And that's important. Too many difficulties already, we do not want to increase them.
•Adj.
What is the •
and why it's left?
a separate entry for gam, upagam, udgam
Can we have a full list of the sopasarga forms now, Jim?
I don't know the word for which Partic. is an abbreviation. GUESS: Participle?
Yes, all kinds of participles. Not only udakta, but also adAna, dRzyamAna.
Is there a location in the front matter or elsewhere that lists such abbreviations?
No. There are only literary sources. The rest was obvious in 1850. Is it not so for you, Jim? :fallen_leaf:
Suspicious that there are only 5; the word Fehlerhaft occurs 88 times.
I have researched it in the past. There are many words used with the same meaning. One of such (not popular, but still), lies
.
What is the • and why it's left?
It is markup added by Thomas. I think it is attached to words which are abbreviations of Grammatical terms. I'll be changing these to <ab>Adj.</ab>
(standard notation for abbreviations), and will generate a list at that time.
full list of the sopasarga forms ?
Should be readily generated from the new form. Since the general pattern will be like the 'n="p"' pattern shown above.
Yes, all kinds of participles.
Thanks for info.
No list of abbreviations like
Partic.
by author.
Too bad. But an opportunity for us to generate a list that will aid modern readers.
Great.
Should be readily generated from the new form.
Hurray!
But an opportunity for us to generate a list that will aid modern readers.
Exactly.
There are many aspects that have arisen in the course of converting the AS (number-letter) coding within pw.txt to modern IAST. I'm classifying the contexts in which AS coding occurs in three parts:
Although I've made about 300 miscellaneous corrections in the course of the work thus far, I'm sure
that there are other spelling errors in the last two (non-<ls>
) groups. There is a gist list iast_check1.txt .
The list has 4998 cases (a small number of these are duplicates). The ✓ in a case indicates that the spelling is probably correct, since the spelling (when converted to SLP1) appears as a headword in pw. There are 2933 of these, and 2065 cases marked TODO. Each case also shows the frequency of occurrence within pw.txt. Taking frequencies into account, there are 26818 text instances that are DONE (marked with ✓), and 4910 that are marked TODO and remain to be checked (so about 15% of the words are unaccounted for among the pw headwords.
The TODO items are further divided into frequently occurring spellings (3 or more instances), and these are marked with an asterisk: TODO*. These 255 cases are the most important, in the sense that they account for nearly half of the instances.
There are also some fairly obvious mis-spellings (obviousness is in the eye of the beholder) (e.g., variants of Kṛṣṇa),
It would be good to get many of the TODO cases examined and corrected by eye.
Durgā : 449 : TODO* :
(PW has this under the
adjective Durga); these can be marked as OK: Durgā : 449 : TODO* : OK
.Durgá : 1 : TODO : Durgā
(The accent was no-doubt a mis-reading of the circumflex, which
is what pw printed text uses for long vowels in his peculiar IAST.)Aussehen : 1 : TODO :
is of this type; maybe the solution can be to flag as OTHER: Aussehen : 1 : TODO : OTHER
.These corrections could be made directly within the gist list, or in a local copy of the gist if that's more convenient. I could make the gist 'Public' if that seems helpful.
If @SergeA has some time to examine, he can probably do many of the TODO cases quite readily. Others are welcome to join in!
There will certainly be some which can't be determined 'by eye' -- they will need a UI type environment so that the print and context can be examined readily. Maybe I'll consider such a UI when the obvious cases are handled.
I could make the gist 'Public' if that seems helpful.
Yes.
Gist list should now be public. I guess that means it is open to collaboration and editing. Someone else should give a try to editing it.
Under headword 'yadi', Thomas expanded an abbreviation M.
to Mānavadharmaśāstra
.
Current coding is:
<div n="1">— 1) {%wenn%} , ‹mit Indic. , Conj. , Pot. und Fut. in der älteren Sprache› ; ‹gewöhnlich
einfacher Nachsatz ohne Partikel.› {#ya/di cit , yadi ha vE , ya/dI/t , ya/dyu#} ‹(35 , 25.36 , 23)› , {#yadyu
vE#}. ‹In den späteren Werken (von <is1>Mānavadharmaśāstra</is1> an)›
Should we leave the expansion or revert to the M.
of the print ?
Here is another example where Thomas expanded an abbreviated word. Clearly the two J.
of the printed text refer to the previous Jaǵus
(modern iast Yajus
). In this case, I think it may be
helpful to leave the expansion. What do others think?
From pw.txt:
{#yajuzwa/s#}¦ •Adv. {%von Seiten des †Jag4us , in Beziehung auf das †Jag4us ,
im Gebiete des †Jag4us%} <ls>21,2.</ls>
<ls>A7PAST.C2R.9,16,4.</ls>
Apparently Thomas did a lot of editing of the pw digitization back in 2005 or so; since the resulting digitization gives no typographical clue with regard to such abbreviation expansions, there's no systematic way to search for them. But it may be helpful to know this feature of the digitization, when, as here, we stumble upon such a case.
Should we leave the expansion or revert to the M. of the print ?
I would stay with print.
In this case, I think it may be helpful to leave the expansion.
One case against 45k does not change a bit. I would not mix.
Apparently Thomas did a lot of editing of the pw digitization back in 2005
That's interesting. If there are hundreds of such, it would be a pity to kill them, but if rare...
Reverted expansions of M. and J. back to M. and J., in agreement with text. @gasyoun Thanks for feedback.
The check_dot list contains items in the pw.txt digitization that Thomas marked with the • (unicode BULLET) character. There are only 77 distinct entries in this file, but 200,000 or so instances of these in the digitization.
The Wikipedia article on German abbreviations has several of these.
Maybe @gasyoun could provide German and English Translations for these, which could then be
used as Tooltips in the displays.
The intent is to change the markup of these in pw.txt to the xml form <ab>X</ab>
.
In doing so, there are several subquestions.
•*X
Many of these abbreviations occur in two forms in the list: •X
and •*X
. From discussion above of meaning of *
, the *
is really a separate piece of information, which says something about the legitimacy of the following word. So, I think it would be more accurate if Thomas had used the coding *•X
since the *
is commenting on the abbreviation.
Thus, the proposed coding of •*X
is *<ab>X</ab>
.
Occurs 603 times.
This is not an abbreviation. An example usage is {#*aRIva#}¦ •gan2a {#zuBrAdi#}.
which says, I think, that the word aRIva is in the word-collection zuBrAdi
.
This gaṇa information is also present in MW, and probably several other dictionaries. However we have not developed markup for it in any dictionary. I would like to find the source document where these
gaṇas are defined; presumably, the other words in a particular gana would shed light on the possible meaning of the particular word.
In the example above, the headword itself is marked with asterisk; and indeed almost all of the instances of •gaṇa are for headwords similarly marked with asterisk. It is interesting to contrast this usage with that of MW. When searching for 'gana' in the text of MW (Advanced Search), the first example is 'ajasraM' ind. perpetually, for ever, ever. [गण स्वर्-आदि, &c ]
. But in pw.txt, under 'ajasra', there is no
mention of a gana.
My inclination as to how to recode gana is:
•gaṇa
as <is>gaṇa</is>
(The <is>
tag is being used to identify the iast-sanskrit words in pw.txt)There are only 3 instances of •Patronn.
(patronymic). I think these should be considered print errors in favor of the more common 1-n version.
There are only 7 •Beinn.
instances. I think these should be changed in favor of •Bein.
I think the » should be dropped - it represents nothing in the printed text, and may have been added just because the grapheme » appears to be pointing to something, and abbreviation s.u. in German means roughly to 'see under' (some following headword).
<lex>
By far the most common abbreviations in check_dot are •Adj. , •m., •f. , •n.
. In MW, the gender information for nominals is marked with <lex>
(e.g. <lex>m.</lex>
).
Perhaps we should use this <lex>
tag in PW for these 4 abbreviations.
What do others think?
Nearly all of these occur followed by Comp. Example under hw aMSukAnta,
{#aMSukAnta#}¦ •m. {%Zipfel eines Gewandes , ~ Tuches%} <ls>296,10</ls> ‹(am Ende eines› •adj. ‹Comp.› •f. {#A#}).
Maybe this pair of words should be a separate abbreviation: <ab>adj. Comp.</ab>
I'm unsure of the meaning.
I think it would be more accurate if Thomas had used the coding *•X
Agree
source document where these gaṇas are defined
Dhaval? @drdhaval2785
I think these should be considered print errors in favor of the more common 1-n version.
Agree.
I think these should be changed in favor of •Bein.
Agree, Bein = a leg, and there is no such word as Beinn and never was.
I think the » should be dropped
Hmm, maybe not dropped, but moved to XML? Because it would later give as hint with hyperlinks - what can and should be linked.
Perhaps we should use this
tag in PW for these 4 abbreviations.
Totally agree.
adj. Comp.
Might be a good idea. The am Ende
(2136 cases) text means, that when the headword is a 2nd part of a word, it has this ending. Anyway it makes more sense than just <gram n="adj">adj.</gram> <noti>Comp.</noti>
Am Anfange
is the opposite (160 cases, strange, I would suppose there are thousands of them). And I found a combination of both as well, am Anfange und am Ende einiger Compp.
. What I see here is Compp.
instead of expected Comp.
check_dot4.txt in gist shows distribution of markup as described above:
<ab>
The ones marked as simple abbreviations<lex>
The ones marked as lexical categories (gender/adj).
<is>
This is the coding just for gaṇa
(not treated as abbreviation).For instance, under headword aMSaka: <div n="1">— 2) *<lex>n.</lex> {%Tag.%}
It is marked as an abbreviation <ab>s.u.</ab>
and the special » character removed.
I don't think removing »
causes any information loss.
We still will be able to analyze such cases further, by examining the following word Here is an example:
{#aRvI#}¦ <ab>s.u.</ab> {#aRu#}.
document where these gaṇas are defined;
https://github.com/drdhaval2785/SanskritVerb/blob/master/Data/gaNapATha_SLP.txt
@drdhaval2785 Could you write a 'readme' type file that explains how to read the gaNapATha file?
For instance, in the PW example given above {#*aRIva#}¦ <is>gaṇa</is> {#zuBrAdi#}.
, what is the
list corresponding to zuBrAdi
. Similarly, from the MW example, how to find स्वर्-आदि gaṇa ?
There is not much to write to readme.
I will write it here itself.
SuBrAdiByaSca 4.1.123
SuBra","vizwapura","brahmakfta","SatadvAra","SatAvara","SatAvara","SalAkA","SAlAcala","SalAkABrU","leKABrU","vimAtf","viDavA","kiMkasA","rohiRI","rukmiRI","diSA","SAlUka","ajabasti","SakanDi","lakzmaRaSyAmayor vAsizWe","goDA","kfkalAsa","aRIva","pravAhaRa","Barata","BArama","mukaRqu","maGazwu","makazwu","karpUra","itara","anyatara","AlIQa sudatta","sucakzas","sunAman","kadru","tuda","akASApa","kumArIkA","kiSorikA","kuveRikA","jihmASin","pariDi","vAyudatta","kakala","KawvA","ambikA","aSokA","SudDapiNgalA","KaqonmattA","anudfzwi","jaratin","bAlavardin","vigraja","vIja","Svan","aSman","aSva","ajira|
Here the first line is the Astadhyayi rule and number which refers to this gaNa. Second line is blank. Third line is list of words in that gaNa.
I am not sure from where I got this file.
Third line is list of words in that gaNa.
Every ganapatha has same number of lines?
This issue devoted to meta-line conversion of PW dictionary.
The markup of the Cologne digitization of this dictionary is quite complex. In addition to adapting the form of the digitization to the meta-line form, attention will be given to making the markup less idiosyncratic, while maintaining informational equivalence with the original markup that Thomas Malten devised.
I anticipate this will be a rather lengthy process. I will aim to indicate in this issue (and perhaps related issues) all of the changes and choices made in the process of markup conversion.
It will likely be relevant to consider the markup of PWG dictionary during this process, as Thomas did the digitizations of both PWG and PW at about the same time, ca. 2005/6 and used many of the same coding conventions in both dictionaries.