sanskrit-lexicon / PWK

Sanskrit-Wörterbuch in kürzerer Fassung, 7 Bände Petersburg 1879-1889
3 stars 1 forks source link

pwkvn revisions, based on AB version(s) #103

Closed funderburkjim closed 4 months ago

funderburkjim commented 7 months ago

Adapt @Andhrabharati revisions of pwkvn 'dictionary' to cdsl.

Here is a devanagari coding for AB to start with (done by JIm, since the pw_transcode program of issue95/pwtranscode does not work now in AB's local system)

pwkvn_AB_v.1_deva.zip

Let's continue discussion of pwkvn in this issue.

Andhrabharati commented 7 months ago

@funderburkjim

Just opened the devanagari file posted by you above.

This appears to have been converted with MW style of accents; pl. regenerate the file with PWG style accents.

Andhrabharati commented 7 months ago

Just tried the conversion again at my end, and it worked without any hassle!

Noticed that I was using the MW style option "slp1 deva" earlier instead of the PWG option "slp1 deva1";

PS C:\pw-transcode> python pw_transcode.py slp1 deva1 .\pw_v2.txt .\pw_v2_deva.txt

[Worked giving the output file.]

and the MW option still has the same problem that I had reported earlier.

PS C:\pw-transcode> python pw_transcode.py slp1 deva .\pw_v2.txt .\pw_v2_deva.txt Traceback (most recent call last): File "C:\pw-transcode\pw_transcode.py", line 149, in lineout = convert_metaline(line,tranin,tranout) File "C:\pw-transcode\pw_transcode.py", line 71, in convert_metaline k1a = transcode(k1,tranin,tranout) File "C:\pw-transcode\pw_transcode.py", line 56, in transcode y = transcoder.transcoder_processString(x,tranin,tranout) File "C:\pw-transcode\transcoder.py", line 257, in transcoder_processString transcoder_fsm(from1,to) File "C:\pw-transcode\transcoder.py", line 74, in transcoder_fsm tree = ET.parse(filein) File "C:\Program Files\Python39\lib\xml\etree\ElementTree.py", line 1224, in parse tree.parse(source, parser) File "C:\Program Files\Python39\lib\xml\etree\ElementTree.py", line 580, in parse self._root = parser._parse_whole(source) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

[Stopped with error report]

Andhrabharati commented 7 months ago

Now, I have converted the pwkvn file also.

funderburkjim commented 7 months ago

From your comments, I conclude that , for the 'pw' dictionaries (pw, pwkvn, pwg)

Is this conclusion correct?

A second question is: Do you need me to run any conversion ?

Andhrabharati commented 7 months ago

Now that I have all 4 of PWG, PWGVN, pwk and pwkvn in devanagari at my end, you don't need to do anything about this.

Looks like the MW type conversion doesn't work at my end now; but that anyway is not at all required for me. And it is surprising that it worked for me initially (when I gave you feedback that its not what I wanted).

funderburkjim commented 7 months ago

Ok. Good. I am puzzled that some of the conversions work for you and some don't.
Since you have what you need at the moment, we can leave the non-working conversions as a mystery, at least for now.

funderburkjim commented 7 months ago

This issue no longer relevant, superceded by #102. Closing.

Andhrabharati commented 7 months ago

I think you're mistaken, @funderburkjim !!

The issues 102 and 103 are about two different portions of pwk, the main text (pw) and the VN part (pwkvn) respectively.

Or did you feel like not 'acting' on this pwkvn part?

funderburkjim commented 7 months ago

reopening

funderburkjim commented 7 months ago

You're right -- my mistake. Have not given up on pwkvn revisions.

funderburkjim commented 6 months ago

@Andhrabharati I'm ready to start work on installing your revisions to pwkvn. Have you already posted your revised version?

Andhrabharati commented 6 months ago

You may recall that you have opened this issue with what you had "transcoded" using MW-style accents, of what I had posted 25 days back (in the 102 issue)!!

funderburkjim commented 6 months ago

@Andhrabharati I see pwkvn_AB.v.1.zip file at this comment in 102.

Please confirm this is your latest form that I should start from.

Andhrabharati commented 6 months ago

Yes, that's the latest file from my side. [I haven't taken up the 'proofing-work' yet.]

funderburkjim commented 6 months ago

temp_pwkvn_ab_0b.zip

In comparing AB version with cdsl version, I am comfortable taking the AB version as is WITH ONE EXCEPTION:

The cdsl version uses a special form <althws>X</althws> to identify multiple (alternate) headwords. There are 1553 of these. The temp_pwkvn_ab_0b.txt file above is same as pwkvn_AB.v.1.txt, except that these althws lines have been re-inserted. This form is used to allow access to entries under the alternate spelling (s). For example:

]

image

Request @Andhrabharati to accept these (of course they may be subject to correction).

Agree?

Andhrabharati commented 6 months ago

I would like to differ on this, @funderburkjim !!

I request you to continue the process adopted in GRA for 'handling' the multiple HWs, across all CDSL works, namely-- to add this addl. line in the xml file, and not in the txt file (which however would have the comma-separated words in the k2-field).

Also it is to remind you that out of ~1500 such lines, nearly ~1000 might need to be removed being just the indexed words in the Vol.7 (of the VN words of Vol.s 1-6); I would again like to reiterate that most of these are not alternate-HWs, but just the adjacent HW entries in those pages and have no 'commonality' like being with variant spellings (or accents) of a word, or having common body-portion etc. to term them as alt. HWs. [I had raised this point earlier also; but did not pursue to the end!!]

And then, the main pw.txt is having couple of thousands of multiple words in the header-line after the meta-line, from which the entries are to be populated into the meta-line's k2-field (as above).

Finally, I am quite surprised (and happy) that you are considering to accept my file AS-IS (thus saving much of your time).

And here is my revised file, with a few (quick) minor changes-- pwkvn_AB v.2.zip

funderburkjim commented 6 months ago

I request you to continue https://github.com/sanskrit-lexicon/GRA/issues/32#issuecomment-1612342970

OK - I've reviewed what was done in Grassman and believe the same technique can be applied to pwkvn for the 'extra' headwords.

Work is done in https://github.com/sanskrit-lexicon/PWK/tree/master/pwkissues/issue103.

There was a minor error (</ab>act.</ab>) in version 2. Here is the revision.

temp_pwkvn_ab_2a.zip

I've confirmed that the displays can be generated from this version 2a.

I could install 2a as the cdsl version now, but suggest you first make version 3 with necessary revisions to 'k2' field of metaline(s) (following the pattern of GRA). Then I'll regenerate the 'alternate headwords' for pwkvn, and install to Cologne.

@Andhrabharati Agree?

Andhrabharati commented 6 months ago

Glad to see you agree for adopting the GRA style (handling of multiple HWs) for other works as well.

Though I myself could do what you suggested further (to populate the k2-field), I have another opinion.

I would rather suggest you to take a 'programmatic' approach for making the the comma-separated list to populate the k2-field of metaline from the header portion of the header-line.

This is not a single operation to be done in just pwkvn, but would be applicable to all other cdsl works in the coming days. So, better avoid manual working!!!

What do you say, @funderburkjim ?

Andhrabharati commented 6 months ago

On a 2nd thought, felt quite a few places might better be done manually. [These would've to be cross-checked again, if done programmatically.]

So, started the work at my end for pwkvn, as asked by Jim.

Andhrabharati commented 6 months ago
  1. The L- numbers 1, 1769, 3238, 4954, 5974, 8175, 9408, 9409 & 22069 [total 9 no.s] are missing, wherever a preceding <H> line occurs in the data.
  2. There are two places (lines 32092 & 80993) where a space occurs at the line-beginning.
Andhrabharati commented 6 months ago

Here is the file with <k2n> field (with comma-separated entities) added after the <k2> field of the metalines-- pwkvn_AB v.3.zip

Only issue I foresee with this is that the homonym data got 'merged' into the k2-element(s), which might pose a minor hurdle in correlating with the pwk (main) data.

I would however suggest Jim to try out the programmatic approach also once, so that the manual work may be compared with it to see how many differences might come out. [This would be deciding the future course of action in the other cdsl works.]

Andhrabharati commented 6 months ago

@maltenth / @fxru / @gasyoun ,

Any idea what the ° before the HW entry means, where it definitely isn't a 'filling' character (from a preceding entry)? [I could see the * character being mentioned in the Vorwort, but nothing regarding the °.]

This is the first entry having * in the pwkvn portion (pwk 1-282) image

This is the first entry where ° isn't a 'filling' character in the pwkvn portion (pwk 1-285) image

And, here is the first entry where ° is a 'filling' character in the pwkvn portion (pwk 1-282) image

Andhrabharati commented 6 months ago

Did few more generic corrections-- pwkvn_AB v.3.zip

@funderburkjim pl. use this as my final version on pwkvn.

funderburkjim commented 6 months ago

algorithm for k2

I'll work on generating pwkvn_hwextra.txt file from k2, based on v.3 Note: I count 1492 such k2, which is close to the 1553 althws !

Interesting idea to develop a programmatic generation of the 'k2' field based on the broken-bar line.
While your work in v3 is fresh in mind, why don't you summarize what you've learned in an informal 'word algorithm' --- that will help me get started with developing a Python program .

maltenth commented 6 months ago

From the introduction: Die im pw ganz fehlenden Wörter resp. Bedeutungen oder ein solches Genus habe ich wieder wie bisher mit ° bezeichnet, während * besagt, daß das Betreffende daselbst noch nicht belegt ist.

° = words or meanings missing in pw.

On Sat, Dec 16, 2023 at 10:31 PM Andhrabharati @.***> wrote:

@maltenth https://github.com/maltenth / @fxru https://github.com/fxru / @gasyoun https://github.com/gasyoun ,

Any idea what the ° before the HW entry means, where it definitely isn't a 'filling' character (from a preceding entry)? [I could see the * character being mentioned in the Vorwort, but nothing regarding the °.]

This is the first entry having * in the pwkvn portion (pwk 1-282) image.png (view on web) https://github.com/sanskrit-lexicon/PWK/assets/75209130/92dc2bda-aec6-4f5b-806b-d30fffdeabbc

This is the first entry where ° isn't a 'filling' character in the pwkvn portion (pwk 1-285) image.png (view on web) https://github.com/sanskrit-lexicon/PWK/assets/75209130/59bb3d0e-7402-4adc-bd3e-1186bcc92c82

And, here is the first entry where ° is a 'filling' character in the pwkvn portion (pwk 1-282) image.png (view on web) https://github.com/sanskrit-lexicon/PWK/assets/75209130/28ff55b2-793b-41c0-a8cf-0959f02aec96

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/PWK/issues/103#issuecomment-1858843924, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADY4EMO76625R2TOKXRHPBTYJW5DJAVCNFSM6AAAAAA7NANLX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYHA2DGOJSGQ . You are receiving this because you were mentioned.Message ID: @.***>

Andhrabharati commented 6 months ago

@maltenth

I think, the above is applicable just to SCH, not to pw(k).

On a 2nd look, I find that many of the °XX places denote suffixes (or terminations) in the pw main portion, so it is very improbable to say that pw itself treats this character to indicate "what is not in pw"!! [Its presence is not confined to pwkvn portion alone.]

Of course, there are many cases of prefixes (YY°) too, wherein quite a few can be considered as 'filling' characters!]

maltenth commented 6 months ago

my bad. I will have another try later.

On Sat, Dec 16, 2023, 22:31 Andhrabharati @.***> wrote:

@maltenth https://github.com/maltenth / @fxru https://github.com/fxru / @gasyoun https://github.com/gasyoun ,

Any idea what the ° before the HW entry means, where it definitely isn't a 'filling' character (from a preceding entry)? [I could see the * character being mentioned in the Vorwort, but nothing regarding the °.]

This is the first entry having * in the pwkvn portion (pwk 1-282) image.png (view on web) https://github.com/sanskrit-lexicon/PWK/assets/75209130/92dc2bda-aec6-4f5b-806b-d30fffdeabbc

This is the first entry where ° isn't a 'filling' character in the pwkvn portion (pwk 1-285) image.png (view on web) https://github.com/sanskrit-lexicon/PWK/assets/75209130/59bb3d0e-7402-4adc-bd3e-1186bcc92c82

And, here is the first entry where ° is a 'filling' character in the pwkvn portion (pwk 1-282) image.png (view on web) https://github.com/sanskrit-lexicon/PWK/assets/75209130/28ff55b2-793b-41c0-a8cf-0959f02aec96

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/PWK/issues/103#issuecomment-1858843924, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADY4EMO76625R2TOKXRHPBTYJW5DJAVCNFSM6AAAAAA7NANLX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJYHA2DGOJSGQ . You are receiving this because you were mentioned.Message ID: @.***>

Andhrabharati commented 6 months ago

Note: I count 1492 such k2, which is close to the 1553 althws !

This prompted me to look back at my file again, and found that I had missed some cases having [H], namely L-2180, 11138, 11139, 18327, 19607 (that were marked earlier by Jim) and L-9059 (missed by both Jim & AB).

And L-203 had no comma or other text to get detected in my previous working. So, I thought I should better change the entry to have 'und', as (809) {#acirAMSu#} <ls>ŚIŚ. 6,71</ls> und {#acirABA#} <ls>KIR. 4,24</ls>.¦

This corresponds to the pw main entry L-1143 {#aciradyuti#}, {#°praBA#}, {#°BAs#}, *{#°rocis#}, *{#acirAMzu#} und *{#acirABA#}¦ <lex>f.</lex> {%Blitz (von kurzem Lichte)%}.

and the pwkvn entry L-9895 {#acirAMSu#} und {#acirABA#}¦ I. 1.

The corresponding change done in the entry of L-9059 is (36230): {#nizkA/vam#} ({#ni[H]zkA/vam#})¦, lies {%zerzausend, zerstückelnd, zerreissend%} und <ab>vgl.</ab> {#sku#}.

The revised meta-lines stand as <L>203<pc>1-283-c<k1>acirAMSu<k2>acirAMSu <k2n>acirAMSu, acirABA <L>2180<pc>2-289-a<k1>anizwubDa<k2>anizwubDa <k2n>anizwubDa, aniHzwubDa <L>9059<pc>6-303-a<k1>nizkAvam<k2>nizkA/vam <k2n>nizkA/vam, niHzkA/vam <L>11138<pc>7-299-b<k1>anizwubDa<k2>anizwubDa <k2n>anizwubDa, aniHzwubDa <L>11139<pc>7-299-b<k1>anizWitASa<k2>anizWitASa <k2n>anizWitASa, aniHzWitASa <L>18327<pc>7-349-c<k1>dOzWulya<k2>dOzWulya <k2n>*dOzWulya, *dOHzWulya <L>19607<pc>7-360-c<k1>pratinisfjya<k2>pratinisfjya <k2n>*pratinisfjya, *pratiniHsfjya

Now the AB count is 1500, with 30 extra L-s [813, 838, 1295, 2213, 2215, 2652, 3221, 3955, 6433, 7341, 9059, 9267, 9693, 10880, 11394, 11969, 15415, 16887, 17362, 17562, 17696, 19167, 19915, 20504, 20698, 20721, 21352, 21506, 21947, 22563] and 83 leftover L-s [488, 882, 1539, 1629, 1683, 2008, 2064, 2508, 2883, 2885, 2893, 2924, 2977, 3129, 3138, 3176, 3280, 3365, 3446, 3511, 3538, 3542, 3551, 3615, 3813, 3883, 4034, 4094, 4098, 4345, 4385, 4595, 4826, 4839, 4850, 4927, 4946, 5504, 5828, 5888, 6102, 6244, 6251, 6255, 6270, 6376, 6441, 6691, 6746, 6747, 6759, 6814, 7052, 7110, 7164, 7168, 7178, 7287, 7402, 7438, 7594, 7713, 7717, 7950, 8156, 8347, 8376, 8437, 8508, 8626, 8675, 8757, 8776, 8844, 8971, 9027, 9139, 9321, 9381, 9387, 9402, 10569, 16901] in comparison with Jim's earlier work.

Noted that some of the leftover L-s do contain HWs inside the body portion, but they would count as derived words [and there are many such words (couple of thousands!?) in the main pw portion, which I had kept on hold for handling some other time in future] and some others that definitely cannot be treated as entries, being corrections etc.

Andhrabharati commented 6 months ago

why don't you summarize what you've learned in an informal 'word algorithm' --- that will help me get started with developing a Python program .

I would rather suggest that you should follow whatever process you had undertaken to mark these as althws initially.

Andhrabharati commented 6 months ago

BTW, I have put the comma-separated entries in a new field k2n, for my convenience.

The k2-content should be replaced with the k2n-content, finally.

funderburkjim commented 6 months ago

AB.v3 wrong format

@Andhrabharati Started work on generating pwkvn_hwextra.txt from your v3. I was expecting the format that we agreed upon with gra.

GRA sample
<L>10<pc>0003<k1>aMsya<k2>a/Msya, a/Msia
<L>25<pc>0004<k1>akutra<k2>a-ku/tra, a-ku/trA
<L>26<pc>0004<k1>akuDryac<k2>a-kuDrya^c, akuDri/ac
<L>51<pc>0006<k1>akzi<k2>a/kzi, akzi/

<L>1078<pc>0116<k1>arva<k2>2. arva, arvan, arvaRa
etc.

But pwkvn-v3 has a different format.

PWKVN
<L>2<pc>1-282-a<k1>a<k2>a   <k2n>2. a°
<L>3<pc>1-282-a<k1>aMSa<k2>aMSa <k2n>aMSa
<L>115<pc>1-283-a<k1>agfhapati<k2>agfhapati <k2n>*agfhapati, *agfhapatika

Request you to provide a version v3a in the gra format that we agreed upon, without the <k2n>.

Andhrabharati commented 6 months ago

@funderburkjim

You can do the simple global replacement <k2>(.*?)\t<k2n> -> <k2>, to get the GRA format. Pl. see my above post reg. the same.

Anyway, here is the file, with my corrections today (as mentioned above)-- pwkvn_AB v.3a.zip

funderburkjim commented 6 months ago

Got v3a . Thanks.

funderburkjim commented 6 months ago

ab_v3a now installed as cdsl version of pwkvn. pwkvn_hwextra.txt recomputed. This done in csl-orig/v02/pwkvn/althws

This issue now closeable.

funderburkjim commented 6 months ago

pwkvn also added to the hwnorm1 list, and hwnorm1, csl-apidev revised accordingly.

maltenth commented 6 months ago

@Andhrabharati

the circle ºxx works the same way in sch and pwk-vn: it indicates that xx is to be added to the preceding string (or as you like to term it, "filling"). When there is no preceding string of characters to attach ºxx to , it always means "new entry" or "missing entry".

@maltenth

I think, the above is applicable just to SCH, not to pw(k).

On a 2nd look, I find that many of the °XX places denote suffixes (or terminations) in the pw main portion, so it is very improbable to say that pw itself treats this character to indicate "what is not in pw"!! [Its presence is not confined to pwkvn portion alone.]

Of course, there are many cases of prefixes (YY°) too, wherein quite a few can be considered as 'filling' characters!]

funderburkjim commented 4 months ago

@Andhrabharati I think this issue can be closed. Do you agree.

Andhrabharati commented 4 months ago

I guess, @maltenth wanted this issue to be reopened for some reason.

From my side, I just wish that you "append" the pwkvn portion to pwk main text; as otherwise it remains inaccessible to many people in general who look at cdsl pwk.