MW supplement fresh look, part 6: Missed comp. word header entries from the Annexure pages

Andhrabharati commented 3 years ago

I would be posting the split portions here, as I complete them.

Andhrabharati commented 3 years ago

Summary of comp. word headers in main pages, as filtered from the present mw text-

"¦ in <ab>comp.</ab>" : 851 no.s
"¦ (in <ab>comp.</ab>" : 157 no.s
"¦ in comp." : 5 no.s
"¦ (in comp."  : 5 no.s

Summary of comp. word headers in annexure pages, as listed by AB- comp. word headers (Annexure)-1 : count 118 (letter अ) comp. word headers (Annexure)-2 : count 83 (letters आ to ऋ) comp. word headers (Annexure)-3 : count 105 (letters ए to घ) comp. word headers (Annexure)-4 : count 101 (letters च to न) comp. word headers (Annexure)-5 : count 89 (letters प to स)

Now these are to be appropriately placed in the data.

gasyoun commented 3 years ago

Now these are to be appropriately placed in the data.

Huge amount, would I say.

Andhrabharati commented 3 years ago

just seen the csl-orig issues, which are mostly one entry at a time.

had I taken that route (which is not my way of handling things), the forum would have been just "flooded" (with thousands of ind. entries)!!

funderburkjim commented 3 years ago

The 'one-entry' csl-orig issues you noticed are documenting 'user' corrections generated when a user submits a correction via the 'corrections' link in displays.

Andhrabharati commented 3 years ago

understood; and I am aware of the error reporting url(s) too.

and I meant using the same path.

gasyoun commented 3 years ago

The 'one-entry' csl-orig issues you noticed are documenting 'user' corrections generated when a user submits a correction via the 'corrections' link in displays.

This way if one ever wanted to he can find what's done with the suggestion. Otherwise it would be only there in Jim's email.

funderburkjim commented 3 years ago

ADD2b.txt

ADD2b.txt This is from 2010.

It contains the supplement digitization BEFORE work to integrate supplement into body.. There are 7141 records. The records are in an xml form.

They are likely in the same sequence as the printed supplement.

One interesting feature is that all the 'comp.' records are present. I found a note from a 2010 readme that many 'comp.' entries were removed intentionally; the reason not clear.

Within those 7141 records, there are 522 containing in <ab>comp.</ab>; this number is quite close to the (+ 118 83 105 101 89) = 496 cases noted above by AB.

For future reference, the location of ADD2B.txt on Cologne server is mwupdate/additions/step1/.

Andhrabharati commented 3 years ago

I have 'sensed' the reason, and would post tomorrow.

Happy to say that I am resuming the work from tomorrow.

funderburkjim commented 3 years ago

Glad you are resuming annexure work soon. Maybe we can finish those compound word header entries this week.

If it would be helpful to split the work, let me know.

Andhrabharati commented 3 years ago

I've changed the plan.

Now going to cover the comp. word headers and the "members" together.

‐----------------

I would rather wish you should change the left grouped words also, for the sake of consistency of the 'theme'.

funderburkjim commented 3 years ago

Please elaborate on change the left grouped words also, for the sake of consistency of the 'theme'.

Andhrabharati commented 3 years ago

Pl look at my remark (in MW#105),

https://github.com/sanskrit-lexicon/MWS/issues/105#issuecomment-793426577

to which your reply was as under-

body portion identical

Where new records were required, I have used this principle. ... ... ... However, for existing 'or/and' groups, I have not retro-changed.

Andhrabharati commented 3 years ago

I was referring to those un-changed cases.

Andhrabharati commented 3 years ago

In the main pages of MW99, a little under 1000 entries of comp. word headers at the beginning of a new para are seen; but not inside any para. These are all of "in comp. for xxx" type headers, which are the variant forms that result due to grammatical rules.

However, there are many case against this in the Annexure pages. Most probably they are just indicators/markers for the indicated comp. words to be placed. Some of these indicate accent change wrt to the main page entry. These are all of "in comp." type (without "for xxx") headers.

But some of those (5 or 6) were already added into the data as "sup" entries. These could be deleted in the light of the first statement above.

The Annexure pages contain just 55 no.s of the "in comp. for xxx" type headers, as in the main pages.

I would be starting with this 55 no.s lot first.

Andhrabharati commented 3 years ago

BTW, I could get my hands on a 1997 file of MW99 text, and it contained the gender endings (with or without braces) that I've been talking about all these days!!

So my presumption that they were absent from the beginning of the digitization is wrong.

This indicates that two MAJOR "discardings" of text or characters have happened in the "evolution" of MW99 digitization.

First one is this gender (nominative) endings at the very inception level and the second one is the semi-colon at the end of the "senses" at each entry (probably in recent times).

Guess @funderburkjim is not part of the team at these two happenings; or if he was a part, then he would have got "matured" over the time to see that every correction has to be verified before "committing" the changes the way he is doing presently.

Getting these "lost" items back into the data requires a hell of exercise, to pay for the negligence. (or were they some deliberate actions?!!)

funderburkjim commented 3 years ago

We can explore the history of the 'lost gender endings' sometime.

Thomas did his digitization work around 1996. The work among Thomas, Peter, Malcolm, and me began about 2004.

There is a lot of saved data on Cologne server that was generated regarding MW99. Roaming these archives I think the original form from Thomas that we (Peter, Malcom, jim) started with is in the file 'MONIER.ALL' ( update/orig/MONIER.ALL on Cologne server). It has an internal line %***This File is E:\SANSKRIT\MONIER\MONIER.ALL, Last update 30.11.04 .

Specifically related to the 'gender endings' that AB finds so fascinating, here is the first relevant entry (6th homonym of 'a'):

<H1>100{a}1{a}6¦ •m. ‹N.…of…Vishnu› ‹¯L.› (especially…as…the…first…of…the…three…sounds…in…the…sacred…syllable›¨#{om}). _ MW000007

Current record of mw.txt:

<L>7<pc>1,1<k1>a<k2>a<h>6<e>1
<hom>6.</hom> <s>a</s> ¦ <lex>m.</lex> <ab>N.</ab> of <s1 slp1="vizRu">Viṣṇu</s1>, <ls>L.</ls> (especially as the first of the three sounds in the sacred syllable <s>om</s>). <info lex="m"/>
<LEND>

This example shows that the gender ending 'as' is missing from all versions.

funderburkjim commented 3 years ago

However, there is still another version from Thomas, which shows that sometimes the gender information was present at an early stage.

For example this display: https://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html

was made by Thomas before the colaboration with Peter et. al.

If you look up 'akula' you see

akula | mfn. not of good family , low ; (%{as}) m. N. of S3iva L. ; (%{A}) f. N. of Pa1rvati1 L.
-- | --

Note: The data for this display on Cologne server is in file scans/MWScan/tamil/dat/mwd.txt.

The MONIER.ALL data does not have the gender endings:

<H1>100{akula}1{a-kula}¦ •mfn. ‹not…of…good…family…,…low› 
<H4>100{akula}1{a-kula}¦ •m. ‹N.…of…S3iva› ‹¯L.› 
<H4>100{akulA}1{a-kulA}¦ •f. ‹N.…of…Pa1rvati1› ‹¯L.› MW000199

The 'mwd.txt' is in a different format from MONIER.ALL. But the data for 6th homonym of 'a' in mwd.txt does NOT have the gender ending: <en>6 m. N. of Vishnu L. (especially as the first of the three sounds in the sacred syllable %{om}). </en>

Andhrabharati commented 3 years ago

yes, this mwd.txt is what I got from the earlier cologne site.

It has ~10k gender endings; though not full trhoughout.

gasyoun commented 3 years ago

It has ~10k gender ending

Ok, now we are in trouble. So after 1997 and before 2004 important data was lost that is relevant to keywords.

We can explore the history of the 'lost gender endings' sometime.

Seems it does not sound too fascinating to Jim )

Andhrabharati commented 3 years ago

One interesting feature is that all the 'comp.' records are present. I found a note from a 2010 readme that many 'comp.' entries were removed intentionally; the reason not clear.

In the main pages of MW99, a little under 1000 entries of comp. word headers at the beginning of a new para are seen; but not inside any para. These are all of "in comp. for xxx" type headers, which are the variant forms that result due to grammatical rules.

However, there are many case against this in the Annexure pages. Most probably they are just indicators/markers for the indicated comp. words to be placed. Some of these indicate accent change wrt to the main page entry. These are all of "in comp." type (without "for xxx") headers.

But some of those (5 or 6) were already added into the data as "sup" entries. These could be deleted in the light of the first statement above.

@funderburkjim, what do you think we should do, about these "in comp." type (without "for xxx") headers?

Remove the 5-6 entries that were added earlier (as mentioned above), or "place" all the other "skipped" ones into the data?

The insertion of the skipped ones mandates working on couple of thousands in the main pages as well (which are not in the print as separate entries), in the name of consistency!!

gasyoun commented 3 years ago

The insertion of the skipped ones mandates working on couple of thousands in the main pages as well (which are not in the print as separate entries), in the name of consistency!!

May I ask for a single example?

Andhrabharati commented 3 years ago

You may just look at the very first page, and you will see about 10 such entries.

funderburkjim commented 3 years ago

The first 'in comp.' example in Annexure is 'aMsa' (p. 1308):

Looking on p. 1, we have 'aMsa' and some compounds:

In this case, I think that the text of p. 1308 aMsa (in comp.); is present only to facilitate the reading of the Annexure; it is proper that this text not be repeated as we merge the annexure entries into the body of the text.

Rather the merger here requires only:

revise text for existing compound aMsaDrI
insert entries for new compounds aMsapIWa and aMsoccaya

funderburkjim commented 3 years ago

Here are all '(in comp.)' examples noticed on p. 1308 of Annexure.

p. 1308, col 1 (in comp.)
aMsa no
aMho yes 
col 2:
akfta no
2 akza no
4 akza no
col 3:
akzi no
akzoBya no
aKaRqa no
agastya no

Those marked 'no' do not need to have the text '(in comp.)' entered -- they are analogous to 'aMsa'.

But 'aMho' does need a 'sup' entry with (in <ab>comp.</ab> for <s>aMhas</s>)text.

That's the way it looks to me.

What do you think, @Andhrabharati ?

Andhrabharati commented 3 years ago

Good to see your response after a long interval, @funderburkjim !

This is exactly what I mentioned above earlier.

So now we need to just add about 50+ "in comp. for xxx" type headers from the annexure pages; and then remove the 5-6 "in comp." type (without "for xxx") headers.

funderburkjim commented 3 years ago

I've made a few changes to MW elsewhere, and would like to make mw_iast.txt consistent. Would that be ok with you if I do that now, @Andhrabharati ?

Andhrabharati commented 3 years ago

The following is the beginning portion of the file I was doing (but stopped)- (in comp. for xxx) headers -1.txt

I can resume the work in a day or two. [Presently (for about a week now) I am working on Vacaspatyam, and it is coming out quite well. Did some formatting, editorial corrections, and now resolving the abbr. (finished all 40+ occurances entries so far and some more). Would need your opinion once I finish the rest, to continue further.]

Andhrabharati commented 3 years ago

I've made a few changes to MW elsewhere, and would like to make mw_iast.txt consistent. Would that be ok with you if I do that now, @Andhrabharati ?

Absolutely no issues. Pl. go ahead.

I've seen your recent 4th April update as the latest. Is there any further one?

funderburkjim commented 3 years ago

Why don't I do the work to integrate your in.comp.for.xxx.headers.-1.txt now, before you do further work on mw.

Andhrabharati commented 3 years ago

ok; then you can get the file by my tomorrow night.

Andhrabharati commented 3 years ago

(the full file)

funderburkjim commented 3 years ago

Will await your notification before proceeding.

gasyoun commented 3 years ago

it is proper that this text not be repeated as we merge the annexure entries into the body of the text.

Agree.

Andhrabharati commented 3 years ago

@funderburkjim

I tried to open the MW file(s) to resume the work other day; but my mind seems not willing to get diverted from the present Vacaspatyam abbr.s task.

So continuing with Vacaspatyam, and so far finished all 5 & above occurrences and part of the rest.

I might need another 3-4 days to finish this work, so that I can be back to MW again.

If you wish, you may go ahead with whatever updates you have in mind on MW.

(I would suggest you to not to touch the portion of the comp. word headers that I gave before- to do that piece of work as a complete one, once I take it up and finish.)

gasyoun commented 3 years ago

I might need another 3-4 days to finish this work, so that I can be back to MW again.

I'm absolutely in love with what I see.

Andhrabharati commented 1 year ago

@funderburkjim

This is the issue I was referring at https://github.com/sanskrit-lexicon/SKD/issues/16#issuecomment-1355955825.

Hoping that you would be covering the annexure pages in your present work, I guess this issue may be closed now.

Andhrabharati commented 1 year ago

As I would be covering this point in my current review, this issue is closable now.

sanskrit-lexicon / MWS

MW supplement fresh look, part 6: Missed comp. word header entries from the Annexure pages #104

ADD2b.txt