MD subheadwords - Githubissues

funderburkjim commented 10 months ago

Objective: make an 'mw-style' version of md.

funderburkjim commented 10 months ago

md_1b_subhw.txt is a proposed intermediate form for sub-headwords. @Andhrabharati please take a look!

funderburkjim commented 10 months ago

readme_md_1b_subhw.txt provides some explanation of the conventions in md_1b_subhw.txt.

Note - I am aware that this is incomplete in various ways. From this form (when corrected), I think a revised 'mdnew.txt' can be made.

The current objective, as I see it, is to 'correct' this file manually.

gasyoun commented 10 months ago

@Andhrabharati with MD sort of MW subheadwords as separate entries and up to 8000 headwords, 10% are verified by Jim manually and eventually it would take 4 more weeks for him. So I hope that in a few days it can be understood what is missing to take the job over from him, thanks.

Andhrabharati commented 10 months ago

md_1b_subhw.txt is a proposed intermediate form for sub-headwords. @Andhrabharati please take a look!

@funderburkjim

I have spent some time looking into your file today.

I must express my feeling openly that you had spent much of your time (over two weeks, as mentioned by @gasyoun) for a wrong purpose; the reason being, you had erred at many places and also at times missed 'capturing' the intent of the author.

Speaking of the author's intent, I have noticed that MD has clearly indicated the purpose of using italics, which surely applies to Boethlingk as well, MD having been closely followed BR's lexicons (in theme and style). [You may recall that we were pondering about the significance of italics in pwk sometime back, which is remaining unanswered so far.]

Andhrabharati commented 10 months ago

Without going into much details, I thought I should atleast give an example entry to compare Jim's work (derived by AB) with what it should've been-- L-12291 Jim.txt L-12291 AB.txt

And if a basic text in the above (AB) manner is made, then the rest of the work could easily be done by Jim (programmatically). [I presume that the Scharf-Sandhi program (does it handle the svara?) is without any errors; I haven't checked it (for I do not need other's code for such operations!)]

funderburkjim commented 10 months ago

Is 'svara' simple-vowel-sandhi, e.g. 'a+a' -> 'A'?

Andhrabharati commented 10 months ago

No, I am talking about the places where accents (as at pra-śaṃsā́ + ālāpa) are involved.

funderburkjim commented 10 months ago

scharfsandhi doesn't handle accents.

Also not ‿.

There may also be instances where the parts to combine are not handled as expected by scharfsandhi.

My conclusion is that for the task of joining the parts (to get k1 from k2), I should write a separate module (perhaps making use of what you refer to in your comment I do not need other's code for such operations!.

funderburkjim commented 10 months ago

comparison of subhw form of AB and Jim

L-12291-jim-corr.txt. My correction of my version for L=12291.

compare_H.txt comparison of AB version of L=12291 to Jim's correction.

From this comparison, there are the same number of subheadwords.

There is only 1 difference in H (ab) <H2>pra-śamaṃ-kara != (jim <H3>praśamaṃ-kara So AB and Jim have a close agreement on what is H2 and what is H3.
( There also may be H4 - e.g. L=12991 at {@-āhāra,@} ) Thus, AB and Jim have essentially identical conception of H2 and H3.

AB's form has no place for the identification of what is or is not a headword represented in MW, PW, etc. But this is not critical -- such analysis can be made after the sub-headwords are identified and parsed.

Similarly, the 'pfx + sfx' part of Jim's form is absent in AB's form. The 'sfx' is given from the {@-sfx@}, and so the 'pfx' could be deduced from AB form -- But this is probably not needed anyway, and was present for internal use by Jim's work.

AB version uses one line (with a tab) for each subhw. While Jim's version uses two lines for each subhw. This difference is non-material.

Conclusion: AB's form and Jim's form are functionally equivalent.

funderburkjim commented 10 months ago

@Andhrabharati Will you undertake the task of completing the subhw markup according to your form?

If so, you may find that you can start with my md_1b_subhw.txt, but discard the ;; subhw ... lines. The result differs from the current csl-orig version in these details:

several corrections (see change_notes_0b.txt
metalines have two revisions:
- k2 (slp1) inferred from the iast form after broken-bar
- <e> S or V or V1 or X used to determine the substantive entries where {@-X@} indicates a subhw.

If you decide to start with your md_AB_V2.txt, you may need to take into account change_notes_0b.txt. Also there is question of the prefixed-verb forms. It might be easier for me to follow your work if you separated the task into two parts: substantive subhws first and then verbal subhws.

Let me know how you plan to proceed.

Andhrabharati commented 10 months ago

@Andhrabharati Will you undertake the task of completing the subhw markup according to your form?

This was the intention, when I had asked about the MD task earlier!!

Now, I am just contemplating whether to delegate the task to @AnnaRybakovaT (not at all doubting her capacity to understand and do things; she has indeed been doing good jobs) or do it myself (looking at the complexity and the time-factor involved; I presume, I am unbeatable in quicker working).

( There also may be H4 - e.g. L=12991 at {@-āhāra,@} )

Yes, the H4 entries would also be marked, wherever seen.

AB's form has no place for the identification of what is or is not a headword represented in MW, PW, etc.

I see no practical value in marking the presence or absence of entries in a work wrt to some other work; hence having no interest in this part.

Similarly, the 'pfx + sfx' part of Jim's form is absent in AB's form.

I haven't given my full file idea, which would be slightly beyond what Jim has proposed.

Pl. have a look at my revised L-12291 file, L-12291 AB (revised).txt [I remember mentioning (in MW repo, sometime earlier), in my opinion, how this additional 'fillings' (absent or shortened in some manner, in the print) should be marked. In the present MW filling, one cannot identify/know whether the filling has happened at the beginning or at the end of the 'printed string'; which I would like to be clearly shown.]

k2 (slp1) inferred from the iast form after broken-bar

Would you be finally making the 'new' entries with the iast text before the broken-bar or after? and any plan to 'pad' the devanagari strings as well to these entries?

I have put the iast string before the bar (for now), as "iast header¦ body", taking that devanagari text would not be there,

If devanagari also is going to be 'padded', then the notation "deva header¦ body" would be appropriate (as in the rest of the text file).

If you decide to start with your md_AB_V2.txt, you may need to take into account change_notes_0b.txt.

Would be doing many more corrections as well(!!), see for example, wrt your

<L>5901<pc>068-2<k1>kiMnara
 {@-nāmaka (ikā),@} -> {@-nāmaka,@} {@(ikā),@}

[as in md_1b_subhw]

<L>5901<pc>068-2<k1>kiMnara<k2>kiM-nara<e>S
{#kiMnara#}¦ kiṃ-nara, <lex>m.</lex> {%fabulous being (half man 🞄half animal) in the service of Kubera; <ab>N.</ab> of 🞄various persons%}; 
;; subhw 1:Y:kiṃ + nāmaka -> kiṃ-nāmaka
<H2> {@-nāmaka,@} ({@ikā),@} 
;; subhw 2:Y:kiṃ + nāmadheya -> kiṃ-nāmadheya
<H2> {@-nāmadheya,@} 
;; subhw 3:N:kiṃ + ta -> kiṃ-ta
<H2> {@-ta,@} 🞄<lex>a.</lex> occasioned by what? 
;; subhw 4:N:kiṃ + *m -> kiṃ-*m
<H2> {@-m,@} why?
<LEND>

[as in print, with missed matter in typing]

<L>5901<pc>068-2<k1>kiMnara<k2>kiMnara
{#kiMnara#}¦ kiṃ-nara, <lex>m.</lex> {%fabulous being (half man 🞄half animal) in the service of Kubera; <ab>N.</ab> of 🞄various persons%};
<div/>{@-nāmaka (ikā), -nāmadheya, -nāman,@} <lex>a.</lex> having what name?
<div/>{@-nimitta@} <lex>a.</lex> occassioned by what?
<div/>{@-m,@} why?
<LEND>

<e> S or V or V1 or X used to determine the substantive entries where {@-X@} indicates a subhw.

Isn't MW having the <e>-field used for <Hn>-category of the entry? Using the same field elsewhere for a different purpose would be conflicting [in name of 'Consistency of style']!!

I did not fully understand the X (non-substantive, non-verbal) type; probably it could (or might have to) be further divided into some 'meaningful' types.

Also there is question of the prefixed-verb forms. It might be easier for me to follow your work if you separated the task into two parts: substantive subhws first and then verbal subhws.

I see that the dhAtus are clearly shown with all-CAPs (iast) in the print; so there is no need for explicit markup further. Probably, we can think of adding the √ to those strings (as done in my recent works, and your acceptance thereof).

And yes, doing the work in two separate sessions is a good idea.

Let me know how you plan to proceed.

I would like to

mark and retain the 'grouped entries' as alt. HWs as is in the text file (with k2-field having the comma-separated entries), and split them in xml file as done in GRA, pwkvn
mark the 𝑃. (Purāṇa) occurrences
mark and retain the 'grouped entries' even at the sub-HW level [see the L-5901 example above]
⋯ ⋯ ⋯

Andhrabharati commented 10 months ago

Just by a cursory browsing, noticed that MD also requires too many markup corrections, as in AP90 that I had mentioned long back.

Andhrabharati commented 10 months ago

Gone through the change_notes_0b.txt, and seen that 3 corrections made there are not required to be done,

<L>983<pc>009-3<k1>aDarAt
{#aDarAt#}¦ adharā́t, <lex>ad.</lex> below; {@-āt,@}
 {@-āt,@} -> {@-tāt,@}  (print-change cf. MW)
 ;; AB the intent is to show the alt. entry without accent, and has PWG corroboration. 
 ;; AB and the change done (adharatāt) deviates the alphabetic order in the print.
 ;; AB no change required here.
---+

<L>1883<pc>019-2<k1>apatita
 {@-anyo'nya-tyāgin,@} -> {@-anyonya-tyāgin,@}  PRINT CHANGE
 ;; AB what prompted this to be changed?
---+

<L>3144<pc>034-2<k1>asamaYja
 {#°sa#} -sa, -> {#°sa#} {@-sa,@}
;; AB when after devanagari string, the iast is not 'bold-lettered'; so no change here.
---+

Andhrabharati commented 10 months ago

Thus, AB and Jim have essentially identical conception of H2 and H3.

This recalls me saying earlier that our mind-wavelengths match; we have similar thoughts on what to do, but different thoughts on how to do!! [I'd choose the simpler (necessary and sufficient) way, and you'd choose the rigorous way.]

Andhrabharati commented 10 months ago

one interesting observation in the MD text:

the ळ (slp1 L; cdsl iast ł) is rendered as iast ḷ [like ऌ (slp1 x; cdsl iast ḷ)] at many places (a big confusion)! [Guess this might be so, in some other cdsl works as well.]

gasyoun commented 10 months ago

Conclusion: AB's form and Jim's form are functionally equivalent.

In that case let's hand it over to @Andhrabharati, our biggest Indian contributor ("I am unbeatable in quicker working" - and unmatched). @AnnaRybakovaT will remain busy and will not each MD in 2024.

I did not fully understand the X (non-substantive, non-verbal) type; probably it could (or might have to) be further divided into some 'meaningful' types.

This @Andhrabharati remains crucial for @funderburkjim

I see that the dhAtus are clearly shown with all-CAPs (iast) in the print; so there is no need for explicit markup further. Probably, we can think of adding the √ to those strings (as done in my recent works, and your acceptance thereof).

no need for explicit markup further - disagree. CAPS are easy to catch for the eye, but ivisible for the computer code. So additional markup is badly needed. We need to unite humans and robots )

This recalls me saying earlier that our mind-wavelengths match

@Andhrabharati if you would only know how important it remains for me.

Guess this might be so, in some other cdsl works as well

A pitty, but yes. SLP1 would be unbeatable for restoring the real picture.

Andhrabharati commented 10 months ago

I see that the dhAtus are clearly shown with all-CAPs (iast) in the print; so there is no need for explicit markup further. Probably, we can think of adding the √ to those strings (as done in my recent works, and your acceptance thereof).

no need for explicit markup further - disagree. CAPS are easy to catch for the eye, but ivisible for the computer code. So additional markup is badly needed. We need to unite humans and robots )

In fact, MD himself has used the √ symbol to denote the roots, when inside the body matter.

While at the entry level the roots are shown in all-CAPs, in the body matter (after the √ symbol) they are shown in all-small letters [except at 2 (out of 1794) places-- √ CHṚD under L-7736 and √ CHĀ under L-7757; probably these two should be considered as print-changes to small letters in the name of consistency! Agree, @funderburkjim ?].

Andhrabharati commented 10 months ago

Also out of 329 places of √ <hom> cases (part of the above 1794), 6 are shown as all-CAPS--

√ <hom>1.</hom> VAS √ <hom>2.</hom> CHAD √ <hom>2.</hom> PAT √ <hom>2.</hom> RUH √ <hom>2.</hom> VṚDH √ <hom>2.</hom> VID

Andhrabharati commented 10 months ago

@Andhrabharati, our biggest Indian contributor

@gasyoun, is there someone outside India, that has done (or doing) such a voluminous (comprehensive) work like me? [Leaving Thomas who got the texts typed, and Jim who is 'presenting' all these texts to the public (in multiple ways); but those are "other dimensions" of the overall work!]

maltenth commented 10 months ago

well, immediately the name of Donald Trump comes to mind

On Mon, Jan 8, 2024, 14:26 Andhrabharati @.***> wrote:

@Andhrabharati https://github.com/Andhrabharati, our biggest Indian contributor

@gasyoun https://github.com/gasyoun, is there someone outside India, that has done (or doing) such a voluminous (comprehensive) work like me?

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/MD/issues/12#issuecomment-1880494992, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADY4EMIT6SQNRR5KYBREC3TYNONRPAVCNFSM6AAAAABBNGQHCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQGQ4TIOJZGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

funderburkjim commented 10 months ago

comment on L-12291.AB.revised.txt

A couple of suggestions, with examples. See L-12291.AB.revised_JF.txt.

Would you be finally making the 'new' entries with the iast text before the broken-bar or after?

discussed in the suggestions.

Would be doing many more corrections as well

Good! Would be interested to see your revised form for <L>5901<pc>068-2<k1>kiMnara,

Isn't MW having the \-field used for \-category of the entry?

Yes. and I anticipate same usage in revised md.txt. The <e>SVX of md_1b_subhw.txt is anticipated to be dropped in generated md.txt

we can think of adding the √ to those strings

Good idea.

mark and retain the 'grouped entries' as alt. HWs

I'd like to see your proposed form for kiMnara example

325 matches for "<H[23]>[^(]*)" in buffer: md_1b_subhw.txt There are also some cases where {@-X@} occurs within a parenthetical group. I'm sure these are not handled properly in my version. Would be good to see your coding for one of these cases.

Regarding alternate hws: Using 'k2' for this purpose good. Is it possible to defer this as a separate step AFTER the subhw markup?

<L>983<pc>009-3<k1>aDarAt

I thought the final word was aDarAttAt as quoted in RV in MW. This would be ok alphabetically. But maybe it is as you say, an accent variant.

{@-anyo'nya-tyāgin,@} -> {@-anyonya-tyāgin,@} Why the change?

I don't think the avagraha should be part of k1. Not sure about k2.

{#iqA#}¦ íḍā, {#iLA#} íḷā,

Good catch -- Should use ł as iast for slp1 L , as in MW. Similarly should use łh as iast for slp1 |.

funderburkjim commented 10 months ago

Please note that previous comment has been expanded from its original form.

funderburkjim commented 10 months ago

italics : MD vs. PW

Note the quote in previous comment.

MD uses italics for 'comments', non-italic for translations

PW(K) is the opposite: italics for translation, non-italic for comment.

PWG same principle as PWK:

@maltenth --- Have I got that straight?

Andhrabharati commented 10 months ago

well, immediately the name of Donald Trump comes to mind

@maltenth

I gave full credits to you and Jim both, whose contributions are the cornerstones of CDSL; and I was just referring to further works like corrections, refining etc. on these texts. It appears that you had taken me wrongly (in your sarcastic post).

Anyway, would you pl. shed a light on the abbr. "M. or N." in MD text (p. 35), like you had helped identifying the "N. N." earlier in Boethlingk's lexicons?

I think this denotes two person names (starting with M and N), but unable to go further.

maltenth commented 10 months ago

@Andhrabharati, our biggest Indian contributor

@gasyoun, is there someone outside India, that has done (or doing) such a voluminous (comprehensive) work like me?

well, immediately the name of Donald Trump comes to mind

my reaction to your remark does not refer to the factual substance of what you or @gasyoun said but to the implication that you were not praised highly enough.

funderburkjim commented 10 months ago

@Andhrabharati Is work on md (subhw) progressing? Anything needed from me?

Andhrabharati commented 10 months ago

It has gone to a much advanced stage than discussed above; but stalled now @funderburkjim.

I do not need anything from you.

gasyoun commented 10 months ago

@Andhrabharati no, you have become the third whale, the third pillar. Outside and inside India. Hope @funderburkjim agrees. No need to stop the work @Andhrabharati as no one comes even close to he level of depth of corrections or speed.

funderburkjim commented 10 months ago

Only kudos for @Andhrabharati contributions to cdsl ! I hope these contributions continue.

There are numerous instances in these issues where my lack of carrying forward his suggestions are mentioned by AB. These are almost always due to my inability to keep up with him -- he can make improvement suggestions faster than I can process them! My aim is to eventually take into account ALL AB's suggestions. Let him be patient with my limitations.

maltenth commented 10 months ago

I would like to offer my unreserved apology to @Andhrabharati for my remarks, and request him to speedily resume his unmatched contributions.

On Sun, Jan 14, 2024, 02:23 Mārcis Gasūns @.***> wrote:

@Andhrabharati https://github.com/Andhrabharati no, you have become the third whale, the third pillar. Outside and inside India. Hope @funderburkjim https://github.com/funderburkjim agrees. No need to stop the work @Andhrabharati https://github.com/Andhrabharati as no one comes even close to he level of depth of corrections or speed.

— Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/MD/issues/12#issuecomment-1890746066, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADY4EMPSFLEZ5PJAYA2I4FDYOLNLTAVCNFSM6AAAAABBNGQHCSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJQG42DMMBWGY . You are receiving this because you were mentioned.Message ID: @.***>

Andhrabharati commented 10 months ago

My (in fact, even Jim and others') contributions to CDSL are voluntary, and no one has asked us for doing them. It is purely out of personal interest, that we are all doing these works.

And my intention in asking @gasyoun (in my original post that has led to this 'storm in a teacup' of misunderstinding) is not for getting appraisals from anyone, but just to get the list of 'team' contributing to the 'corrections process' made [whose names are lying beneath the covers (or in Jim's mind) till now]. It is not at all a competitive work, but a collaborative effort that needs to be in such projects as CDSL.

@maltenth , I did not anticipate an apology from you; I am really sorry, if my words have prompted you for this.

I might resume the working (in my way) after a small gap; but shall be attending to the smaller stuff like what Dhaval is assigning me these days (after he took over the baton of doing the csl-corrections, from Jim.)

gasyoun commented 10 months ago

@Andhrabharati as I'm preparing a paper on the Future of Cologne, such a get the list of 'team' contributing to the 'corrections process' made will be badly wanted. Hope @maltenth can take an eye on it as well, as only @funderburkjim has seen it yet and there are plenty of space for improvement for sure.

funderburkjim commented 10 months ago

Just want to mention I'm working on the md subheadword project. This is done by editing a work-form. This is what I've edited thus far, with one unedited sample at the bottom.

temp_md_subhw_sample.txt

I am only editing the lines under the lines below the '* +' lines, but not the lines beginning with '1'.

Andhrabharati commented 10 months ago

@funderburkjim

May I ask you not to spend your time in this MD_subhw issue, but to focus on other issues?

Once I am back to work (probably within a week or so), I shall post my MD file; and most likely you would not hesitate to "take" the same. Thus your time and effort [the result of which may not be 'used finally'] on this issue might go wasted.

Andhrabharati commented 10 months ago

Just want to mention I'm working on the md subheadword project. This is done by editing a work-form. This is what I've edited thus far, with one unedited sample at the bottom.

temp_md_subhw_sample.txt

I am only editing the lines under the lines below the '* +' lines, but not the lines beginning with '1'.

@funderburkjim Just thought of looking into your file, and noted that it has nearly 1% wrong entries that were 'added' (10 out of 1126).

2 aṃsa-kūṭa--pṛṣṭha {@-pṛṣṭha,@} <lex>n.</lex> ridge of the shoulder.
;;  aṃsa--pṛṣṭha

2 a-gaṇ-ay-at--i-tvā    {@-i-tvā,@} <ab>gd.</ab> {!<ab>id.</ab> = disregarding!}
;; a-gaṇ--i-tvā

2 a-guṇa-jña--vat   {@-vat,@} 🞄<lex>a.</lex> void of merit, bad; 
;; a-guṇa--vat

2 a-guṇa-jña--śīla  {@-śīla,@} <lex>a.</lex> of bad disposition, 🞄worthless.
;; a-guṇa--śīla

2 agra-nakha--nāsikā    {@-nāsikā,@} <lex>f.</lex> tip of the nose, — beak; 
;; agra--nāsikā

2 acira--prabhā {@-prabhā,@} {@-bhās,@} {@-rocis,@} {@-‿aṃśu,@} {@-‿ābhā,@} <lex>f.</lex> <ab>id.</ab>.
;; could be further expanded & there are quite a few that were left out from expansion.

2 a-deśa-jña--stha  {@-stha,@} <lex>a.</lex> absent <ab>fr.</ab> his country, absentee.
;; a-deśa--stha

2 anyathā-darśana--prathā   {@-prathā,@} <lex>f.</lex> becoming different; 🞄
;; anyathā--prathā

2 apara-pakṣá--rātrá    {@-rātrá,@} <lex>m.</lex> {@i,@} <lex>f.</lex> second half of 🞄the night; 
;; apara--rātrá 

2 apara-pakṣá--vaktra   {@-vaktra,@} <lex>n.</lex> <lex>a.</lex> {%metre%}.
;; apara--vaktra

@gasyoun / @drdhaval2785 These are the type of errors (and accent related ones, wrongly put or ignored many times) that I had mentioned to be present in the CDSL MW text, long ago when I started interacting at GitHub and also quite recently.

funderburkjim commented 10 months ago

I'm glad we agree in 99% of cases.

If you use aṃsa--pṛṣṭha , then alphabetical ordering fails.

funderburkjim commented 10 months ago

It is good that those, whose knowledge of Sanskrit is much greater than mine, are closely examining the cdsl versions of the dictionaries.

Andhrabharati commented 10 months ago

Though the alphabetical order is generally followed, we DO come across the entries at a 'wrong' place/order in various dictionaries.

Ultimately what matters is whether the word is 'proper' or not.

Pl. see the screenshot--

funderburkjim commented 10 months ago

Achsel = armpit (google translate). How do we know that aṃsa-kūṭa--pṛṣṭha is not a word unique to MD. How do we know that it is not proper?

Andhrabharati commented 10 months ago

MD, MW and others rarely go beyond Boethlingk.

For now see, what STC has (showing the three independent compounds formed from aṃsa) --

When I post my analysis on MD, rather applicable to most of the CDSL dictionaries, the reason for all such cases would at once be clear.

Andhrabharati commented 10 months ago

How do we know that it is not proper?

aṃsa is the shoulder (that is 'visible' on top side); no doubt about this, right?

aṃsa-kūṭa is the kūṭa projection (hump) between the shoulders of the oxen.

pṛṣṭha is rear or back; thus, the rear of the 'top-side' (aṃsa) shoulder is the 'underneath' (aṃsa-pṛṣṭha) armpit. And there is no such 'element' that is at the rear of the bull's hump (other than the skin underneath!!), in the whole creation.

sanskrit-lexicon / MD

MD subheadwords #12

comparison of subhw form of AB and Jim

comment on L-12291.AB.revised.txt

italics : MD vs. PW