sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

unmarked abbreviations #88

Closed funderburkjim closed 7 months ago

funderburkjim commented 2 years ago

There are quite a few unmarked abbreviations in pw.txt. Derive a procedure for identifying and marking many of these.

funderburkjim commented 2 years ago

This work done in abbrev directory.

It is known that there remain unmarked abbreviations within text marked as italic. (These were intentionally avoided in current work, due to potential interaction with possible automated english translation work).

funderburkjim commented 2 years ago

italic text.

Decided to add<ab> markup to italic text fragments.

There are over 170000 italic text fragments (135000 entries) in pw.

In these about 10500 additional abbreviation markups were added (9800 of them for 'best.' bestimmte - a certain (kind of)).

freq_ab_2.txt gives the updated abbreviation frequency counts.

Andhrabharati commented 2 years ago

@funderburkjim It appears that you had checked only for "candidates" preceded by a space.

There are more unmarked candidates that could be identified from this list (made by "breaking" the pw.txt at spaces and taking '.' entries)- pw_hunt (ab and lex tags).txt

I had used the file updated by you above (csl-orig/v02/pw/pw.txt).

Of course, few in this list might need some textual correction as I had seen randomly (but not corrected).

funderburkjim commented 2 years ago

My 'candidates' were the abbreviations that already have tooltips: pwab_input.txt

I did not require a preceding space, though most of the unmarked abbreviations were in fact preceded by a space.

Your pwhunt list has over 5300 items. Most of these are surely NOT abbreviations. It might be that there are some in this list which ARE abbreviations, but which do not yet appear in pwbib_input.txt. And it would be good to (a) add these to pwbib_input.txt and (b) mark them with<ab>X</ab> markup.

@thomasincambodia could supply us with German tooltips, but we should not dump a list of 5300 words in his lap to weed through.

The list needs to be reduced by excluding words that are in a list of known German words. Could @drdhaval2785 or @Andhrabharati do this?

Andhrabharati commented 2 years ago

@funderburkjim

My main intention in giving the list, is to point to the unmarked (listed) ab texts with opening/closing (or both) brace.

  1. In both places where it is present, <ab>m.</ab> is not a contraction to be marked with <ab> tag, it is the sound 'm'. So, the marking has to be removed.
  2. At <L>35309 and <L>35693, <ab>f.</ab> does not seem to indicate the feminine gender [Thomas might provide the correct expansion]; and at <L>74652 the marking could be changed to <lex>f.</lex>, as is throughout the text..

And as I had mentioned elsewhere long back, there are too many unlisted abbr.s in these German works (PWG and pwk); one needs good know-how to 'handle' these.

Andhrabharati commented 1 year ago

<L>35309: {#*gAMmanya#}¦ <lex>Adj.</lex> {%sich für eine Kuh haltend <ab>f.</ab> e. K. geltend.%}

<ab>f.</ab> e. K. seems to be for für eine Kuh that is present before it!

<L>35693: {#guRagfhya#}¦ <lex>Adj.</lex> (<lex>f.</lex> {#A#}) {%eine Vorliebe für Vorzüge habend , <ab>f.</ab> V. empfänglich%}

<ab>f.</ab> V. seems to be for für Vorzüge that is present before it!

My recent exercise with GRA makes me somewhat confident enough to indulge into marking the pw abbr.s as well; shall I take up the work, @funderburkjim ?

funderburkjim commented 11 months ago

marking the pw abbr.s as well

Yes. I think abbreviation markup is a useful aspect of the cdsl digitizations. Often difficult, as we learned with GRA, but the end result is helpful to any users of the displays.

Andhrabharati commented 11 months ago

@funderburkjim / @drdhaval2785,

Would you pl. reopen this issue, so that I can post the abbr. entities to be validated and filled up? [I thought I should leave the "filling-up" task to @maltenth or @fxru, as the current abbr.s have both German and Latin expansions. I could try the German ones, but that would not be a FULL filling-up.]

I had tried my best to mark all the abbr. entities throughout the file, but there could be a few (probably very few!) left still.

The global abbr.s counts now are 102777 total, and 295 unique [as against CDSL file data of 91898 and 63 resp.]. The local abbr.s counts now are 996 total, and 580 unique [as against CDSL file data of 8 and 7 resp.; there are 4 (2) addl. entries that belong to is-abbr.s].

funderburkjim commented 11 months ago

Reopening

Andhrabharati commented 11 months ago

Here are the abbr.s marked in pwk and pwkvn, both global type & local type--

Global abbr. (pwk).txt Global abbr. (pwkvn).txt [pwkvn has 13 "new" global abbr.s wrt the pwk.]

Quite a few of these are present in GRA as well, wherein I had tried expanding them into German. But, for the reason mentioned in my prev. post, I thought I should better leave the work here to someone else.

Local abbr. (pwk).txt Local abbr. (pwkvn).txt [It may be noted that SCH is helpful in identifying some local abbr.s, as they are in expanded form as against the pwk(vn).]

There is one entity "N. N." that occurred once in pwkvn and thrice in pwk pages (and PWG also has it 6 times), that I could not 'decipher' and marked as <ab n="???">N. N.</ab>.

Though I had tried filling up the local abbr.s (looking mostly at the preceding text), I feel there might be some grammatical forms that need to be considered; as such, these need someone (proficient in German language) looking at them all over once, to check and make them proper.

@thomasincambodia could supply us with German tooltips, but we should not dump a list of 5300 words in his lap to weed through.

@funderburkjim,

Now that the count of global abbr.s is just about 300 (294 in pwk and addl. 13 in pwkvn), can they be sent to @maltenth and/or @fxru, for expanding into German and Latin forms? ---------------------- PS. I had made the pwkvn file also in the similar format as the pwk file now, which has unearthed multiple errors that were corrected in the process.

gasyoun commented 11 months ago

[I thought I should leave the "filling-up" task to @maltenth or @fxru, as the current abbr.s have both German and Latin expansions. I could try the German ones, but that would not be a FULL filling-up.]

@Andhrabharati I do not believe in what you wrote. There is no other person other than you able to fulfill this task. @fxru at best could partly verify it. (= "need someone (proficient in German language) looking at them all over once, to check and make them proper.")

I had tried my best to mark all the abbr. entities throughout the file, but there could be a few (probably very few!) left still.

Your best is more than needed and more than enough.

N. N.

Could you quote the cases in full?

Tried a few from PWK:

<ab>überh.</ab>
<ab>übertr.</ab>
<ab>Ueberh.</ab> Ueberhaupt
<ab>Uebers.</ab>
<ab>Uebertr.</ab>
<ab>unbest.</ab>
<ab>uneig.</ab>
<ab>Uneig.</ab>
<ab>ungedr.</ab>
<ab>Unterschr.</ab>
<ab>urspr.</ab> ursprünglich
<ab>v. a.</ab>
<ab>v. l.</ab>
<ab>v. u.</ab>
<ab>Verb.</ab>
<ab>Vergl.</ab>
<ab>vgl.</ab>
<ab>Vgl.</ab>
<ab>viell.</ab>
<ab>Voc.</ab>
<ab>z. B.</ab>
<ab>Zahladv.</ab> Zahladverb
maltenth commented 11 months ago

N. N. is used when you don't yet know who is going to be nominated, for example, as the head of a section in a conference, planned years before. It's an abbr. for Latin 'Nomen nominandum'.

On Sat, Aug 5, 2023, 04:51 Mārcis Gasūns @.***> wrote:

[I thought I should leave the "filling-up" task to @maltenth https://github.com/maltenth or @fxru https://github.com/fxru, as the current abbr.s have both German and Latin expansions. I could try the German ones, but that would not be a FULL filling-up.]

@Andhrabharati https://github.com/Andhrabharati I do not believe in what you wrote. There is no other person other than you able to fulfill this task. @fxru https://github.com/fxru at best could partly verify it. (= "need someone (proficient in German language) looking at them all over once, to check and make them proper.")

I had tried my best to mark all the abbr. entities throughout the file, but there could be a few (probably very few!) left still.

Your best is more than needed and more than enough.

N. N.

Could you quote the cases in full?

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/PWK/issues/88#issuecomment-1666214220, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADY4EMMFAOVXH2WXHTLVKIDXTVVGDANCNFSM5YEL6E6A . You are receiving this because you were mentioned.Message ID: @.***>

Andhrabharati commented 11 months ago

N. N.

Could you quote the cases in full?

N. N. is used when you don't yet know who is going to be nominated, for example, as the head of a section in a conference, planned years before. It's an abbr. for Latin 'Nomen nominandum'.

Here are the two entries in pwk that has the entity--

<L>8321<pc>1097-2<k1>amuka<k2>amuka<e>000 {#amuka#}¦ <lex>Pron.</lex> (<lex>f.</lex> {#A#}) {%der und der,%} die Stelle eine Namens vertretend und unserem {%<ab n="???">N. N.</ab>%} entsprechend. <LEND>

<L>8322<pc>1097-2<k1>amukIya<k2>amukIya<e>100 {#amukIya#}¦ <lex>Adj.</lex> {%<ab n="???">N. N.</ab> gehörig%}. <lex>f.</lex> {#A#} so <ab>v. a.</ab> {%Gattin des <ab n="???">N. N.</ab>%} so ist wohl zu lesen <ab>st.</ab> {#amukIda#} <ls>Ind. St. 5,370</ls> und {#amukidA#} bei <ls>GOLD.</ls> <LEND>

And the pwkvn case is--

<L>13704<pc>7-315-c<k1>asOyaja<k2>asOyaja {#asOyaja#}¦ I. Genauer {%die Formel%} „<ab n="???">N. N.</ab> {#yaja#}“. <LEND>

The PWG cases are--

[under the entry <L>1694<pc>1-0123<k1>adas<k2>ada/s<h>1] Dieses pron. und seine Ableitungen werden auch zur Bezeichnung unbestimmter, im Augenblick nicht zu nennender Personen oder Gegenstände verwendet, <ab>z. B.</ab> in den Formeln des <ls>AV.</ls> an den Stellen, welche der Name desjenigen einzunehmen hat, gegen den die Formel gerichtet ist: {#tEzwvA\ sarvE^ra\Bi zyA^mi\ pASE^rasAvAmuzyAyaRAmuzyAH putra#} {%mit diesen Banden allen binde ich dich <ab n="???">N. N.</ab>, von <ab n="???">N. N.</ab> stammend, der <ab n="???">N. N.</ab> Sohn%} <ls n="AV.">4,16,9</ls>;

[under the entry <L>5175<pc>1-0376<k1>amuka<k2>amuka] {#amuka#}¦ (von {#amu#}) pron. {%der und der%}, die Stelle eines Namens vertretend und unserm {%<ab n="???">N. N.</ab>%} entsprechend: {#ahamamukaH sAkzI#} <ls>YĀJÑ. 2,87.</ls> {#amukena#} <ls n="YĀJÑ. 2,">88.</ls> {#amukaputra#} {%der Sohn von <ab n="???">N. N.</ab>%} <ls n="YĀJÑ. 2,">86.</ls> {#amukasUnu#} <ls n="YĀJÑ. 2,">88.</ls> <ls>MAHĪDH.</ls> zu <ls>VS. 10,30.</ls>

and the next one under the entry <L>46026<pc>4-0784<k1>purAkalpa<k2>purAkalpa, appears to be not conforming to what @maltenth has explained above--

{#purAkalpa#} (= {#yugAntare#} <ab>Erkl.</ab>) {#etadAsIt#} <ls>PAT.</ls> in <ls>Ind. St. 5,163, **N. N.** 3.</ls> [as this occurred under an ls-entity, there is no ab-tag here (as I mostly left many abbr.s unmarked under the ls-places)]

Here is the corresp. Ind. St. 5,163 page, showing the Note. 3 (***)-- image

I think, the N. appears to have been repeated by error here.

Andhrabharati commented 11 months ago

@maltenth, Glad to see you coming-in!!

And, would you pl. consider having a re-look at the 28 <sic/> places in pwk? (which probably seem to have been marked by you, as per @funderburkjim)

<sic/> 28 possible error in text. Needs further investigation

Here are the extracted lines from my latest pwk file-- pwk-(sic) cases.txt

[BTW, why two of these have no preceding '|' mark, @funderburkjim ?]

Andhrabharati commented 11 months ago

It's an abbr. for Latin 'Nomen nominandum'.

When searched for this expansion, Google came up with Nomen Nescio, which matches with the Skt. amuka very aptly. [Looks like Nomen Nominandum is somewhat related to 'nominating' and Nomen Nescio is used to signify an anonymous or non-specific entity.]

"die Stelle eine Namens vertretend und unserem N. N. entsprechend." is Google-translated to English as "representing the place of a name and corresponding to our N. N."

What do you think, @maltenth ?

funderburkjim commented 10 months ago

@Andhrabharati Request you to upload your latest temp_pwk_ab_1.txt. My focus is on the abbreviation markup related to your abbreviation files above.

Andhrabharati commented 10 months ago

@funderburkjim As far as the abbr.s are concerned, my earlier posted file pw (AB v1).zip (25 July) can be taken as the working file. [I had mentioned the same while posting it]

This can be used to check and workout the abbr. expansions, if nothing else. [Probably Thomas and/or Felix Rau could be reached out to help in the process.]

As your temp_pwk_ab_1 is based on it, the same can be used by you now for the said purpose.

My latest file has another entry (Dass.) added now; whereas the earlier file has only dass. marked.

funderburkjim commented 10 months ago

minor corrections to pwABv1

See corrections_ab_0.txt.

Noticed during xml validity check on temp_pw_ab_0.txt.

Andhrabharati commented 10 months ago

You have already looked at the AB_0 file and arrived at the next version at pwk issue 95.

I think the corrections that you suggested now are already incorporated there.

Andhrabharati commented 10 months ago

As mentioned in my earlier post (just above your latest post), you could start with your temp_pwk_ab_1 file now.

funderburkjim commented 10 months ago

My work on abbreviations in this issue set up to begin at pwkissues/issue88 directory. Nothing much done yet.

It was very confusing to resolve the two versions of pwk from AB. We should both aim to avoid such version confusion as we independently work on pwk.

Andhrabharati commented 10 months ago

No confusion, @funderburkjim !

Just use the temp_pwk_ab_1 file posted by you, at the other issue #95.

funderburkjim commented 10 months ago

@Andhrabharati FYI: I've been working on this, and may have something to post in a few days.

Andhrabharati commented 10 months ago

Good to know, @funderburkjim !

And I am going back home tomorrow (after 3 months); so we can have the long awaited ls "filling" too (with what I had done those days).

But before that, you might have to update the ls entities again, as my recent work had some extra markings.

Andhrabharati commented 10 months ago

And are the latin expansions also being made along with german, for the abbr.s?

funderburkjim commented 10 months ago

resolving <ab>X</ab> differences

I've now completed a resolution of the differences between the global abbreviation markup present in (a) the current cdsl version of pw.txt and (b) your version mentioned above. The issue88/readme.txt contains a summary, and many more details are in the ablists directory.

@Andhrabharati You need to determine whether you accept the revision I made in your version. The revision is temp_pw_ab_2.zip

All in all 673 lines in your file were changed, as detailed in change_pw_ab_2.txt. The vast majority of these involve the 'v. l.' and 'vgl' abbreviations. I checked the scans for about 100 of these, and for some reason the cdsl abbreviation almost always agreed with the scan -- this is why I made changes to your file for these.

By contrast, over 48242 lines in the cdsl version were changed! It was gratifying to see how much improvement (including many typos) these changes add to the cdsl version.

I realize there are many other markup changes that you have added and which need to be included in the cdsl version. But before attending to these, I want to know if you accept the changes I made which are in temp_pw_ab_2.txt

Andhrabharati commented 10 months ago

@funderburkjim

Sometimes my work gets some wrong corrections, with regex. & normal expression variation; I notice quite a few and correct them back while working. This 'vgl.' -> 'v.l.' (and then 'v.l.' -> 'v. l.') seems to be one of such type, which was not noticed by me. As this accounts for the most of the differences (637 out of 673), you may continue further with the temp_pw_ab_2.txt.

[I could not find time today to look at the data; even if there might be some changes that I meant to be there (with a reason), they could always be changed later.]

funderburkjim commented 10 months ago

wrong corrections, with regex.

Same for me. ...

they could always be changed later.

Yes. I'll continue

Andhrabharati commented 10 months ago

@funderburkjim

I could take sometime today to look at the temp_pw_ab_2.txt.

There are just three lines to be corrected in it-- change_pw_ab_2 (AB).txt

funderburkjim commented 10 months ago

local abbreviations.

Work continues to be in directory pwkissues/issue88.

These changes take into account file change_pw_ab_2.AB.txt mentioned a couple of comments above.

temp_pw_3.txt and temp_pw_ab_3.txt are now in agreement for both <ab>X</ab> and <ab n="T">X</ab>.

funderburkjim commented 10 months ago

the footnote line

There is one change mentioned in change_pw_ab_2 that deserves comment - the footnote line.

The default display, shown below, seems acceptable to me.

Here are the displays of this line (available in my local installation) under entry ली,

cdsl display

image

ab display

image
Andhrabharati commented 10 months ago

@funderburkjim

I have noticed a couple of ??? entries in your ab_local1.txt file.

Here are some of them, as resolved in my later working--

0,1 a. u.=??? <ab n="???">a. u.</ab> ;; could this be "u. a." instead? 0,3 Chr.=??? <ab n="nach">n.</ab> <ab n="???">Chr.</ab> ;; <ab n="nach Christus">n. Chr.</ab> 3,3 N. N.=??? <ab n="???">N. N.</ab> ;; <ab n="Nomen Nescio">N. N.</ab> 0,2 NO=??? <ab n="???">NO</ab> ;; <ab n="Nord-Ost">NO</ab> 0,1 NW=??? <ab n="???">NW</ab> ;; <ab n="Nord-West">NW</ab> 0,1 SO=??? <ab n="???">SO</ab> ;; <ab n="Süd-Ost">SO</ab> 0,1 SW=??? <ab n="???">SW</ab> ;; <ab n="Süd-West">SW</ab> 0,1 u.=??? <ab n="???">u.</ab> ;; <ab n="unterthan">u.</ab>

Also noticed this remark in readme.txt --

dass. When is it an abbreviation?

It is an abbr. where it means dasselbe (= the same thing) similar to id. used in many other works, while the plain dass (= that) is not an abbr. And, this occurs mostly at the end of a line [or 'sense'], or before an ls-entity.

Andhrabharati commented 10 months ago

There are three places in my file(s), having no space between >{; these are to be with a space in between > {-- (68606) </ab>{%reichend, hindringend%} (106037) </hom>{#u#} (340332) </ab>{#pUrvaBA°#}

funderburkjim commented 10 months ago

additional changes

See change_pw_ab_4.txt.

Changes per previous two comments in AB version One additional change: }< -> } < (1 in AB version)

Similar changes in cdsl version change_pw_4.txt.

funderburkjim commented 10 months ago

@maltenth has agreed to review the local abbreviations (<ab n="tooltip">X</ab>). He also suggests that the displays should show the 'tooltip' rather than the abbreviation for these local abbreviations.

I'll work with him to develop materials to make his review as efficient as possible.

Andhrabharati commented 10 months ago

And, what about the global abbr.s?

funderburkjim commented 10 months ago

I guess you are referring to tooltips for global abbrs. We can deal with those (hopefully with Thomas help) after the local abbrs are done.

funderburkjim commented 10 months ago

dev_tm

https://sanskrit-lexicon.uni-koeln.de/work/pwk_tm/web/

A version of the displays with the tooltips displayed for the local abbreviations. Currently, these are rendered as @TIP (in blue). The '@' is temporary, for Thomas.

funderburkjim commented 10 months ago

ab_local_tm_0.txt
extracts the entries of pw.txt with local abbreviations (currently 843 such entries). And for each of these,

When used with an appropriate text editor (such as notepad++), the links are 'active' (in notepad++, the link will open in browser window when double-clicked).

The <ab n="TIP">X</ab> can be edited in this file, if necessary.
The resulting edited file can be sent back to me, and will be the basis for a further update of pw.txt.

We'll have to see whether @maltenth finds this a useful workflow.

Andhrabharati commented 10 months ago

I wonder if @maltenth entertains slp1!

funderburkjim commented 9 months ago

user corrections

I just finished installing the user corrections in csl-orig. This includes changes to 20 lines for pw. The local versions relating to this issue have been similarly updated, in my temp_pw_4.txt and temp_pw_4_ab.txt. See

Andhrabharati commented 9 months ago

Yes, @funderburkjim; I was watching the activity and already changed the pwk and MW files at my end.

funderburkjim commented 9 months ago

abbreviation corrections

@maltenth provided corrections to local abbreviations. See ab)local_tm_0_corr.txt.
The changes were extracted into change_pw_5.txt. Application of these changes resulted in the temp_pw_5.txt, the cdsl version of the pw digitization.. 116 lines changed. (compare to roughly 1000 local abbreviations.) temp_pw_5.txt has been installed at Cologne. Manually, these changes were also applied to Andhrabharati's version. See change_pw_ab_5.txt. This results in temp_pw_ab_5.zip The cdsl and ab versions agree with respect to global and local abbreviations. The local abbreviations, with counts, may be seen at ab_local5.txt.

In the displays, the local abbreviations are now shown 'expanded' per Thomas' suggestion, and in blue text. For example <ab n="medicinischem">med.</ab>.

image

global tooltips

There are now 300+ global abbreviations in pw digitization. I attempted to assign tooltips to these, which are in pwab_input.txt in csl-pywork repository. These may also be seen, along with counts, in ab_glob5.txt.

gasyoun commented 9 months ago

Are they to be applied to PWG as well?

funderburkjim commented 9 months ago

Makes sense to apply to PWG as well. Before that, there is still work to be done in current round of PW editing.

und, unter

Thomas deferred consideration of the 'u.' abbreviations, because they are somewhat tricky to correct - because u. could stand for 'und' or 'unter. There are about 150 instances of these. I will prepare a document for him in this regard.

missing global tooltips.

There are still about 30 abbreviations with unknown tooltips (refer ab_glob5.txt '?') @maltenth and/or @Andhrabharati may be able to resolve these, as well as review the other tooltips.

When these tasks are accomplished, we would be ready to apply these to pwg (and pwkvn). @Andhrabharati Agree?

funderburkjim commented 9 months ago

typo correction in pw

Thomas noticed several typos (of German text) during the course of reviewing the local abbreviations, and expressed an interest in getting engaged in the further correction process of pw. Great! Deciding how to approach such typo correction will be subject for another issue.

Andhrabharati commented 9 months ago

@maltenth and/or @Andhrabharati may be able to resolve these, as well as review the other tooltips.

Here are some in the list-- inbes. 18,18 ? to be replaced with "insbes." instebs. 1,1 ? - ? to be replaced with "insbes." Kalb. 1,1 ? - ? no dot after Kalb; it is a typo in the text. Pt. 1,1 ? - ? to be replaced with Pl.; it is a print error. Red. 5,5 ? - ? to be replaced with "Bed."

I also corrected more typing errors than I would have expected to find. ⋯ ⋯ So I have some sympathy for the frequent moaning of AB and also want to atone for these shortcomings by getting engaged in the further correction process of pw. Maybe you have some ideas about what can be done.

Pl. see my remark at the end of this post!!

In the displays, the local abbreviations are now shown 'expanded' per Thomas' suggestion

This makes me believe that Thomas himself had done so in the text file itself earlier, as I had doubted elsewhere

Andhrabharati commented 9 months ago

also want to atone for these shortcomings by getting engaged in the further correction process of pw.

Though not a shortcoming as such, this post may probably be looked at by @funderburkjim and @maltenth.

Andhrabharati commented 9 months ago

When these tasks are accomplished, we would be ready to apply these to pwg (and pwkvn). @Andhrabharati Agree?

Sure it is the way to go, @funderburkjim; but just like to point out that PWG has many more items ab-marked.

Andhrabharati commented 9 months ago
<L>25977<pc>2041-3<k1>kAkakulAyaganDika (; case 135)
124754 new {#kAkakulAyaganDika#}¦ <lex>Adj.</lex> {%stinkend wie ein Krähennest%} <ls>AIT. ĀR. 352,3</ls> <ab n="a.u. = ab usu - as usual ?">a. u.</ab>

It is in fact, v. u. (a print error)-- image

which is mentioned 'correctly' at the other word at pwk 1200-1 (<L>16624<pc>1197-3<k1>i<k2>i<h>3) image