sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

Fresh Look, starting with `<is>` tag #95

Closed funderburkjim closed 6 months ago

funderburkjim commented 1 year ago

Work initially related to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/419.

funderburkjim commented 1 year ago

Work in this issue is done in the pwkissues/issue95 directory of this repository.

start with latest pw.txt

@Andhrabharati please start with latest csl-orig/v02/pw/pw.txt. A few (19) changes were made during development of transcode script. You could name this file 'temp_pw_0.txt'.

transcode script

The pw_transcode.py script converts pw from one transcoding to another.

First, make the 'pwtranscode' directory current terminal directory

Note 1: If you convert from slp1 to iast and then (without making changes to the iast version) immediately convert the iast version back to slp1 (under differently named file), then that differently named file and the original file should be identical.

Note 2: Conversion is applied to (a) both the k1 and k2 fields of metaline and (b) the {#X#} elements of the text.

Andhrabharati commented 1 year ago

I had taken the recent pw.txt from csl-orig, for my present working.

Will incorporate the 19 changes done by you now in my file.

And, a big "Thank you" for the conversion scripts.

Andhrabharati commented 1 year ago

Noted that 8 of 19 were already changed during my working.

funderburkjim commented 1 year ago

retain line-numbering

As with the work on Gra, request you maintain the line numbering in revisions of pw.txt. Then at the end of the pw revisions, we can remove unneeded blank lines.

funderburkjim commented 1 year ago

@Andhrabharati So I can follow your comments (such as at https://github.com/sanskrit-lexicon/CORRECTIONS/issues/419#issuecomment-1640198843), why don't you upload a zip of your current pw.txt.

Andhrabharati commented 1 year ago

@funderburkjim

Still quite a bit of work is remaining to cleanup the data to give out my prelim. file.

I had only looked at the portions marked as italic; there are quite many places not marked so in the text (but are in italics, in print) that need to be identified.

My present focus is on marking the abbr.s inside italics as well as outside.

Pl. wait for few more days.

Meanwhile, you may start looking/working on BHS, which I had made earlier & recently 'marked' citation numbers after GRA. Shall I post the same at the csl-devanagari repo (that discussed this point), which you could then take-up in a relevant repo?

Andhrabharati commented 1 year ago

Also quite many places are not marked with is-tag!!

Andhrabharati commented 1 year ago

@funderburkjim

Just tried converting the slp1 file at my end, and got

<L>1<pc>1001-1<k1>अ<k2>अ<h>1<e>000 1. {#अ#}¦ Pron. der 3ten Person. Davon {#अस्मै॑ , अस्यै॑ , अस्मा॑त् , अस्या॑स् , अस्य॑ , अस्मि॑न् , अस्या॑म् , आभ्या॑म् , एभि॑स् , आभि॑स् , एभ्य॑स् , आभ्य॑स् , एषा॑म् , आसा॑म् एषु॑ , आसु॑#} {%Diesem , diesem hier%} <ab>u.s.w.</ab> Unbetont <ab>Subst.</ab> {%ihm , ihr%} <ab>u.s.w.</ab> <ab>Vgl.</ab> {#अयम् , अया , इदम् , इम , इयम् , एन , एना#}. <LEND>

instead of [from my earlier file pw_AB_08.txt]

<L>1<pc>1001-1<k1>a<k2>a<h>1<e>000 1. {#अ#}¦ Pron. der 3ten Person. Davon {#अस्मै꣫, अस्यै꣫, अस्मा꣫त्, अस्या꣫स्, अस्य꣫, अस्मि꣫न्, अस्या꣫म्, आभ्या꣫म्, एभि꣫स्, आभि꣫स्, एभ्य꣫स्, आभ्य꣫स्, एषा꣫म्, आसा꣫म् एषु꣫, आसु꣫#} {%Diesem, diesem hier%} <ab>u. s. w.</ab> Unbetont <ab>Subst.</ab> {%ihm, ihr%} <ab>u. s. w.</ab> <ab>Vgl.</ab> {#अयम्, अया, इदम्, इम, इयम्, एन, एना#}. <LEND>

The remark is about the Vedic svara conversion.

Probably the underlying "rule" files (in the transcoder folder) are not the ones that we had 'finalised' earlier for the PW group.

Would you pl. check this once?

funderburkjim commented 1 year ago

Would you pl. check this once?

pw-style devanagari accents

@Andhrabharati Yes, you are right regarding conversion. To get the 'pw' style devanagari accents, you will need to

I think this will solve that problem.


Note: 1 typo noticed, under (slp1) <L>132690<pc>7240-3<k1>svardfS, {#suArdf/Z#} -> {#suArdf/S#}

funderburkjim commented 1 year ago

Shall I post the same at the csl-devanagari repo

Sure, go ahead. I'll take a look.

Also please note that I need a posting of your current pw; so I can respond to the <is n="abbrev">X</is> question you raised.

Andhrabharati commented 1 year ago

@funderburkjim

Posted my BHS file at the relevant repo.

Andhrabharati commented 1 year ago

Got 9 more abbr. type is-entities, in the non-italic part (while checking the dot-ending words)--

<is n="Acchāvāka">A.</is> <is n="Iṣṭi">I.</is> <is n="Kālidāsa">K.</is> <is n="Magundī">M.</is> <is n="Tīrtha">T.</is> <is n="Trigarta">Tr.</is> <is n="Uṣṇih">U.</is> <is n="Uttaraphalgunī">Uttaraph.</is> <is n="Virāj">V.</is>

And, noted that some entries listed in the pwis_mw.txt are in fact typos. This prompts me to look at the full set of <is>-words now, to "close" the issue.

Andhrabharati commented 1 year ago

And, noted that some entries listed in the pwis_mw.txt are in fact typos. This prompts me to look at the full set of <is>-words now, to "close" the issue.

Just showing an example word on this point, <is>Kānda</is>--

<L>81565<pc>5003-3<k1>maNgalika<k2>maNgalika/<e>100 {#maNgalika/#}¦ (wohl <lex>n.</lex>) <ab>Pl.</ab> vielleicht <ab>Bez.</ab> {%der Lieder des 18ten%} <is>Kānda</is> im <ls>AV.</ls> <LEND>

image

MW entry for kānda is

image

And, the MW entry for maṅgalika is

image

Finally, the pwk print has this as

image

Thus, we can see that this entry has both a typo error (Kānda) as well as a print error (Kāṇda) in pwk, whereas it should’ve been Kāṇḍa.

Andhrabharati commented 1 year ago

BTW, this above example reminds me of the very initial comments on the ls-entity display of PWG (and pwk) posted by me--

first and next (Note 4)

But it appears that either these posts have skipped Jim's attention, or he didn't see any value in this point.

I feel REALLY bad whenever I see Rv, Av, etc. on CDSL PWG/pwk search results, while the MW display renders them 'appropriately' as RV, AV etc..

Andhrabharati commented 1 year ago

And, noted that some entries listed in the pwis_mw.txt are in fact typos.

The fist entry that I had noted this discrepancy in is-words wrt the mw-words is dvipa that occurred 25 times, either by itself (Dvipa 6 times-- all in error) or as part of another word (dvipa 19 times-- all marked as notmw); whereas it should've been Dvīpa or dvīpa respectively.

Andhrabharati commented 1 year ago

retain line-numbering

As with the work on Gra, request you maintain the line numbering in revisions of pw.txt. Then at the end of the pw revisions, we can remove unneeded blank lines.

Sorry for having 'violated' this, @funderburkjim !

Rather, I haven't violated but just implemented the style I started in GRA, in this pw as well.

I have started with minimal line-number changes (limited to 'embedding' [Pagexxxx] into other lines), for now; but I have more changes in mind, to prepare this pw in a "standard style" to be followed in the other CDSL works as well.

Feel free to revise any parts of gra9. Go ahead and add entries such as the missed headwords.

… … … …

Friendly reminder - keep a note file as you change; this will be a guide to me of your changes. These notes will be helpful to me in constructing the displays from your revised version. The 'tags' files you provided will also find use when the display programs are revised.

@Andhrabharati The baton is now in your hands for the next leg of this Grassman marathon. Good luck!

Hope you'd allow me a 'free-hand'(!!) here also, as done at GRA recently.

Andhrabharati commented 1 year ago
  1. Any line starts only with one of the 5 types-- <L>, Header, <div, <F> and <LEND>
  2. No blank lines are present within the entry portion; and just a single blank line is present when a new entry starts.

These two were the binding-principles reg. the text-lines that I followed in the pw.txt file, and did the following replacements--

image

and this can be taken as my starting file, [the split lines are marked as ;; split]--

pw_CDSL_0.zip

Is this in compliance with our earlier posts 1 and 2?

Andhrabharati commented 1 year ago

Hope you'd allow me a 'free-hand'(!!) here also, as done at GRA recently.

If you have other thoughts, I shall post only the relevant <is and <ab strings, retaining the cdsl text 'form' as is, though it amounts to a minor rework at my end (withholding my current plan).

If you happen to agree (I just hope you would!), then I shall start posting what all I have done so far [having finished the abbr. portion], and my prelim. file.

funderburkjim commented 1 year ago

pw_cdsl_0

These observations based on work in pwkissues/issue95/compare0 directory.

Generation of displays (locally) using pw_CDSL_0 encounters no problems. The generated pw.xml validates with pw.dtd. Great!

A couple of minor observations: I renamed the file pw_CDSL_0.txt to temp_pw_ab_0.txt

  1. At line 106035 replace <L>55397 with a blank line
  2. When comparing with current pw.txt, I did not see any extra lines corresponding to ';; split'
    • for instance at L>13120, both AB version and have 4 lines. So why the ';; split' ?
  3. I found no lines starting with < at second character, e.g. I found no lines starting with image, so the comment above is confusing.

Seems ok to proceed with further revisions.

Andhrabharati commented 1 year ago

2. When comparing with current pw.txt, I did not see any extra lines corresponding to ';; split'

* for instance at L>13120, both AB version and have 4 lines. So why the ';; split' ?

@funderburkjim

pw_CDSL_0.txt is not the version that I am working with; it is just regenerated from pw.txt to match the lines with my AB file.

Here is the screenshot comparing the two files--

image

And you may see the split in my AB file at "Mit {#kar#}", breaking the prev. line into two lines.

3. I found no lines starting with < at second character, e.g. I found no lines starting with image, so the comment above is confusing.

There are no lines starting with •<ab etc.

My comment clearly shows the replacement of line starting with <ab getting merged into the prev. line as •<ab \n<ab -> •<ab (total 102 such replacements)

See the first such occurrence at lines 24204-6 in pw.txt

<div n="1">— 2) Praep. mit
[Page1063-1]
<ab>Acc.</ab>

that get merged in pw_CDSL_0.txt (line 23955) as <div n="1">— 2) Praep. mit •[Page1063-1] •<ab>Acc.</ab>

These are all (almost) the cases of what I mentioned above as "limited to 'embedding' [Pagexxxx] into other lines".

Andhrabharati commented 1 year ago

Generation of displays (locally) using pw_CDSL_0 encounters no problems. The generated pw.xml validates with pw.dtd. Great!

This file has no "real" changes made, except the line mergers at [Pagexxxx].

funderburkjim commented 1 year ago

Got it. Ready for 'real' changes.

Andhrabharati commented 1 year ago

Here is the prelim. file to go through meanwhile (as you had done with my GRA file earlier, without any notes).

pw (AB v1).zip

This can be used to check and workout the abbr. expansions, if nothing else. [Probably Thomas and/or Felix Rau could be reached out to help in the process.]

I will start posting the notes from tomorrow morning, indicating various changes went into the file to get the prelim. file at my end (as of now), as I am too tired now. [It is just past mid-night here.]

funderburkjim commented 1 year ago

first observations on ab_1

Work in directory pwkissues/issue95/compare1.

xml

A few changes are required for the constructed pw.xml to be well-formed and to be validated against the corresponding one.dtd.

metaline differences

There are 14 fewer metalines in ab_1 than in ab_0. I haven't examined metaline differences more closely.

Andhrabharati commented 1 year ago
  • Note <bio> element is used for scientific names of animals. I think <bio> should be used instead of <zoo>, for consistency with MW markup.

@funderburkjim

Yes, I've seen the <bio tag being used in MW; but as I was not bothered much about the tags those days, I did not do anything on that.

Now, as I am looking at 'every aspect' (almost!) in the text file(s), I feel that the bio tag should not be used for the animals (and fowna). [I took liberty to introduce more meaningful and appropriate tags, as I am working on the FULL data of late; and it is up to you, whether to accept and use the new (or revised) tags or to discard them and continue with the old (current) tags.]

Bio(logy) encompasses both Bot(any) and Zoo(logy), broadly the immovable (स्थिर) and movable (चर) living bodies (animates) [of course, there are few exceptions-- there are some plants that 'move' and some animals that 'do not move' (sessile animals)].

And as <bot tag is employed for the Botanical entities, it is more appropriate to use <zoo tag for the Zoological entities.

[BTW, the MW file includes items like <bio>Canopus</bio>, which is not an animal in strict sense but a Star; so it should not be bio-tagged.]

Andhrabharati commented 1 year ago
  • new element 'iw'

The PW group (and some other works) has Substantatives (mostly proper names) in wide-spaced lettering, that cover Sanskrit as well as other language entities; and I thought <is should be limited to 'is sanskrit' and hence added a new tag <iw 'is wide-spaced' for other languages.

Andhrabharati commented 1 year ago

new value 'unknown' for n-attribute of 'lang' element

* '???' cannot be used in the attribute value list in the DTD  `<!ATTLIST lang n (greek | ???) #IMPLIED>` is not valid DTD

I have modified the <lang n="???"> as <lang n="UNK">, and hope this is alright.

I vaguely remember seeing the script somewhere (but unable to recall now), apart from PWG (5-0078) that has the same string image

that got carried in to pwk (4223-2). image

And both PWG.txt and pw.txt have this "UNK" language string missing altogether.

Could someone identify this script/word?

Andhrabharati commented 1 year ago
  • extend the allowed values for n-attribute of 'is' element

  • new 'n' attribute for <bot> element

I thought it is alright to adopt the n-attribute process in is- and bot- elements as well (just like in ab- elements), to expand the in-line (or local) abbreviations of those types.

funderburkjim commented 1 year ago

zoo tag

Since you have a strong opinion on this, I'll add the 'zoo' element to the dtd.

iw tag

Thanks for explanation. Sounds reasonable.

UNK

This is acceptable. I'll change the dtd to accept this as a value for n attribute of lang element.

n-attribute for bot, is elements

This should be ok. The basicdisplay will need modification for the tooltip to display.

funderburkjim commented 1 year ago

compare metalines ab_0 v. ab_1

See results under 'compare1/readme.txt' at 'compare_hw step 2'. 31 of the 34 differences between the metalines in temp_pw_ab_0.txt and temp_pw_ab_1.txt are ab1 corrections to errors in ab0.

The other 3 (marked 'abi error?' ) should be corrected in temp_pw_ab_1.txt. Also, there is 1 misc. suggested correction. @Andhrabharati Request you to make these 4 corrections in temp_pw_ab_1. Agree?

funderburkjim commented 1 year ago

text after <LEND>

in temp_pw_ab_1, 3027 instances with text following <LEND>. One is <LEND>〉 and the rest are like <LEND>•[Page1003-3].

@Andhrabharati Are these temporary markup?

Andhrabharati commented 1 year ago

The other 3 (marked 'abi error?' ) should be corrected in temp_pw_ab_1.txt. Also, there is 1 misc. suggested correction. @Andhrabharati Request you to make these 4 corrections in temp_pw_ab_1. Agree?

ab1 errors ? (based on differences between metalines in ab0, ab1 versions.

only ab0: <L>13353<pc>1158-1<k1>Akarika<k2>Akarika<e>100 only ab1: <L>13353<pc>1158-1<k1>AkAraka<k2>AkAraka<e>100 ab1 error? cf pwg, alphabetical order (Note: 'pw print error')

pwk (1158-1) has image I had corrected this as per scan, though I noticed the alpha. order error and the error in the HW [the cited text is having Akarika only , and so does the PWG entry]; I thought of changing it in the next round of Header proofing (that would take a longer time!). Now that you have raised the point, shall correct this now itself.

only ab0: <L>19684<pc>1240-3<k1>upanikzepa<k2>upanikzepa<e>100 only ab1: <L>19684<pc>1240-3<k1>upanikzepa<k2>upanikzepa<e>100ṇ ab1 error? ṇ

Yes, this letter got here by error.

only ab0: <L>78979<pc>4253-2<k1>Barb<k2>°Barb<e>500 only ab1: <L>78979<pc>4253-2<k1>Barb<k2>*Barb<e>500 ab1 error?

This is to be taken as a print error. A skt. root cannot and does not have the contraction mark before it; it always occurs on its own. And note the '*' mark at the following variant root BarB. image

only ab0: <L>120161<pc>7058-2<k1>samaha<k2>samaha.<e>100 only ab1: <L>120161<pc>7058-2<k1>samaha<k2>sa\ma\ha\<e>100 ab1 correction. additionally, ab1 error?: (add comma) {#praSasta#}, {#saDana#}

Does this pwk (7058-1) snippet answer the point? image

So in summary, I need to correct only 2 places out of these 4.

Andhrabharati commented 1 year ago

One is `

This character got here by error.

and the rest are like•[Page1003-3]`.

@Andhrabharati Are these temporary markup?

Yes, and you had accepted the <LEND> [Pagexxx] lines recently in GRA.

Now that the line-breaks around [Pagexxxx] lines are looked at, we can remove this • character thoughout. [It will be reintroduced in the upcoming major change shortly!!]

Andhrabharati commented 1 year ago

Next, I will start posting the changes made and then this (first-part of) Fresh-look issue can be closed, as it is growing longer.

[I could not do this yesterday, having been engaged in some pressing chores.]

Andhrabharati commented 1 year ago

The IAST corrections matter could be continued in the parent issue (PW IAST corrections #419), as this <is> tag issue might be closed after my change notes are posted.

Andhrabharati commented 1 year ago

I have started with the simplest point, as mentioned at Space before punctuation marks (reg. PWG, pwk and pwkvn) #855 ,

and the counts now stand thus in my version of pwk--

image

Notes.

  1. All the 7 places reg. full-stop are within the Devanagari slp1 strings denoting the danda (6) or double danda (1).
  2. There are 7 ';;' places now that show the running remarks in the text line, which would get removed (after AB's revision).
Andhrabharati commented 1 year ago

Next point is separating out the <ab- and <is- elements from within italic strings.

Any person closely looking at the print pages, can notice that

With this background, the separation of various text strings from italics has been carried out.

The abbr. counts now stand thus- image

And here is a summary of the global abbr.s that occur in italics & outside- image

and the local abbr.s that occur in italics & outside- image

Finally, the total occurrences apart, the unique abbr. counts are thus- image

Andhrabharati commented 1 year ago

The <is> details in a similar manner cannot be posted yet, as some work is yet to be taken up, as mentioned above. [Quite a few <is> unique strings as listed to be in mw (by Jim earlier) might get corrected; a reduction of over 1000+ is estimated.]

Andhrabharati commented 1 year ago

One interesting point noticed is that at some places, the abbr.(s) in print pages are present in expanded form in the text file, most probably done by @maltenth (or who else could it be?) while applying his markups on the typed text [it is highly doubtful that the typists at India would have done this expansion].

Also seen that at many places the marked italic strings are not so in the print; and at far more places the italic strings of the print are present in normal face in the typed text.

There is no way except a full reading wrt the print to "correct" these points completely, I suppose.

funderburkjim commented 1 year ago

ab1: <L>120161<pc>7058-2<k1>samaha<k2>sa\ma\ha\<e>100

I agree that the print has a comma. But this comma is missing in temp_pw_ab_1:

{#sa\ma\ha\#}¦ <lex>Adv.</lex> {%irgendwie, so oder so%}. Nach <ls>SĀY.</ls> <lex>Adj.</lex> 
(= {#praSasta#} {#saDana#} <ab>u. s. w.</ab>) im <ab>Voc.</ab>
               COMMA MISSING

Doesn't that comma need to be added to pw_1 ?

Accept your point Barb. WIll start a print change file for this and perhaps other future print changes that arise.

Andhrabharati commented 1 year ago

But this comma is missing in temp_pw_ab_1

Sorry, I was looking at my current file that has undergone more changes; it has the comma here.

funderburkjim commented 1 year ago

<LEND>[Pagexxx] accepted in GRA

No, this is a case where you did something in GRA that I was not aware of. If I had noticed it, I would have complained.

The <LEND> line is important since it marks the end of an 'entry' which starts with the metaline. When a (python) program processes an xxx.txt file, it must identify this end-of-entry line. There are (at least) two possible ways to make this identification

  1. line == "<LEND>"
  2. line.startswith("<LEND>")

(1) would NOT recognize <LEND>[Pagexxx]. (2) would recognize both <LEND>[Pagexxx] and <LEND>

Although I have thought of (1) as the default, I have (AFAIK) used (2) in all existing code, and (2) doesn't care if there is additional information -- in particular, programs work with <LEND>[Pagexxx].

I still have a fondness for

<LEND>
[Pagexxx]

<L>....

Conclusion: I DO accept <LEND>[Pagexxx] if it is important for your analysis. And I will continue to use method (2) to recognize the end of entry.

Andhrabharati commented 1 year ago

I think we can get rid of that [Pagexxxx] after <LEND> altogether, as the page change-over would anyway be reflected in the next meta-line's <pc> value. In effect, this is a repetition of information.

And as you had mentioned elsewhere, this <pc> or [Page....] info is not used anywhere except to link to the resp. page to display when clicked on it.

My thinking is that this [Pagexxxx] need/should not be on a separate line.

funderburkjim commented 1 year ago

this (first-part of) Fresh-look issue can be closed, as it is growing longer

Agree.

the IAST corrections matter could be continued in the parent issue #419

Prefer you to start a new issue here in PWK repository when you're ready.

Andhrabharati commented 1 year ago

Prefer you to start a new issue here in PWK repository when you're ready.

In such a case, the referred parent issue can be closed. No need to keep it open until a new issue is opened for the <is elements.

I see many issues still remain open in various repos, though their purpose is served.

funderburkjim commented 1 year ago

the abbr.(s) in print pages are present in expanded form in the text file,

Please post the examples you have noticed. We can ask @maltenth if he recalls some reason. Or, we may find a pattern. I can also check against an early version of pw.txt (in case I introduced these expansions ! )

at many places the marked italic strings are not so in the print; and at far more places the italic strings of the print are present in normal face in the typed text.

Again, post some examples if they are at hand.

I have wondered about the significance of italic/non-italic text in PW. Maybe if this distinction were conceptually clear, we could find some way to identify (and correct) many of these mistakes in pw.txt.

Andhrabharati commented 1 year ago

I have wondered about the significance of italic/non-italic text in PW.

The italics mostly denote the meaning/explanation portions in German language, as I could see.

gasyoun commented 1 year ago

The italics mostly denote the meaning/explanation portions in German language, as I could see.

Wonder if the preface gives a clue, if reread.

gasyoun commented 1 year ago

I vaguely remember seeing the script somewhere (but unable to recall now), apart from PWG (5-0078) that has the same string

https://ru.wikipedia.org/wiki/%D0%91%D0%B0%D0%B3%D0%B0%D1%82%D1%83%D1%80

монг. baγatur (ᠪᠠᠭᠠᠲᠦᠷ )

ᠪᠠᠭᠠᠲᠦᠷ

Andhrabharati commented 1 year ago

монг. baγatur (ᠪᠠᠭᠠᠲᠦᠷ )

@gasyoun

I could see the letter y in between and the letter t is not matching the character in the PWG print; so the word appears to be the (Mongolian) baga(?)yur.

Can your (Mongolian) friend tell why the PWG has the (Mongolian) lettering upside-down and then left-to-right (or in other words, rotated by 180 degrees)?