Closed Andhrabharati closed 1 year ago
Summary of comp. word headers in main pages, as filtered from the present mw text-
"¦ in <ab>comp.</ab>" : 851 no.s
"¦ (in <ab>comp.</ab>" : 157 no.s
"¦ in comp." : 5 no.s
"¦ (in comp." : 5 no.s
Summary of comp. word headers in annexure pages, as listed by AB- comp. word headers (Annexure)-1 : count 118 (letter अ) comp. word headers (Annexure)-2 : count 83 (letters आ to ऋ) comp. word headers (Annexure)-3 : count 105 (letters ए to घ) comp. word headers (Annexure)-4 : count 101 (letters च to न) comp. word headers (Annexure)-5 : count 89 (letters प to स)
Now these are to be appropriately placed in the data.
Now these are to be appropriately placed in the data.
Huge amount, would I say.
just seen the csl-orig issues, which are mostly one entry at a time.
had I taken that route (which is not my way of handling things), the forum would have been just "flooded" (with thousands of ind. entries)!!
The 'one-entry' csl-orig issues you noticed are documenting 'user' corrections generated when a user submits a correction via the 'corrections' link in displays.
understood; and I am aware of the error reporting url(s) too.
and I meant using the same path.
The 'one-entry' csl-orig issues you noticed are documenting 'user' corrections generated when a user submits a correction via the 'corrections' link in displays.
This way if one ever wanted to he can find what's done with the suggestion. Otherwise it would be only there in Jim's email.
ADD2b.txt This is from 2010.
It contains the supplement digitization BEFORE work to integrate supplement into body.. There are 7141 records. The records are in an xml form.
They are likely in the same sequence as the printed supplement.
One interesting feature is that all the 'comp.' records are present. I found a note from a 2010 readme that many 'comp.' entries were removed intentionally; the reason not clear.
Within those 7141 records, there are 522 containing in <ab>comp.</ab>
;
this number is quite close to the (+ 118 83 105 101 89) = 496 cases noted above by AB.
For future reference, the location of ADD2B.txt on Cologne server is mwupdate/additions/step1/.
I have 'sensed' the reason, and would post tomorrow.
Happy to say that I am resuming the work from tomorrow.
Glad you are resuming annexure work soon. Maybe we can finish those compound word header entries this week.
If it would be helpful to split the work, let me know.
I've changed the plan.
Now going to cover the comp. word headers and the "members" together.
‐----------------
I would rather wish you should change the left grouped words also, for the sake of consistency of the 'theme'.
Please elaborate on change the left grouped words also, for the sake of consistency of the 'theme'.
Pl look at my remark (in MW#105),
https://github.com/sanskrit-lexicon/MWS/issues/105#issuecomment-793426577
to which your reply was as under-
body portion identical
Where new records were required, I have used this principle. ... ... ... However, for existing 'or/and' groups, I have not retro-changed.
I was referring to those un-changed cases.
In the main pages of MW99, a little under 1000 entries of comp. word headers at the beginning of a new para are seen; but not inside any para. These are all of "in comp. for xxx" type headers, which are the variant forms that result due to grammatical rules.
However, there are many case against this in the Annexure pages. Most probably they are just indicators/markers for the indicated comp. words to be placed. Some of these indicate accent change wrt to the main page entry. These are all of "in comp." type (without "for xxx") headers.
But some of those (5 or 6) were already added into the data as "sup" entries. These could be deleted in the light of the first statement above.
The Annexure pages contain just 55 no.s of the "in comp. for xxx" type headers, as in the main pages.
I would be starting with this 55 no.s lot first.
BTW, I could get my hands on a 1997 file of MW99 text, and it contained the gender endings (with or without braces) that I've been talking about all these days!!
So my presumption that they were absent from the beginning of the digitization is wrong.
This indicates that two MAJOR "discardings" of text or characters have happened in the "evolution" of MW99 digitization.
First one is this gender (nominative) endings at the very inception level and the second one is the semi-colon at the end of the "senses" at each entry (probably in recent times).
Guess @funderburkjim is not part of the team at these two happenings; or if he was a part, then he would have got "matured" over the time to see that every correction has to be verified before "committing" the changes the way he is doing presently.
Getting these "lost" items back into the data requires a hell of exercise, to pay for the negligence. (or were they some deliberate actions?!!)
We can explore the history of the 'lost gender endings' sometime.
Thomas did his digitization work around 1996. The work among Thomas, Peter, Malcolm, and me began about 2004.
There is a lot of saved data on Cologne server that was generated regarding MW99.
Roaming these archives I think the original form from Thomas that we (Peter, Malcom, jim) started with is
in the file 'MONIER.ALL' ( update/orig/MONIER.ALL on Cologne server).
It has an internal line %***This File is E:\SANSKRIT\MONIER\MONIER.ALL, Last update 30.11.04
.
Specifically related to the 'gender endings' that AB finds so fascinating, here is the first relevant entry (6th homonym of 'a'):
<H1>100{a}1{a}6¦ •m. ‹N.…of…Vishnu› ‹¯L.› (especially…as…the…first…of…the…three…sounds…in…the…sacred…syllable›¨#{om}). _ MW000007
Current record of mw.txt:
<L>7<pc>1,1<k1>a<k2>a<h>6<e>1
<hom>6.</hom> <s>a</s> ¦ <lex>m.</lex> <ab>N.</ab> of <s1 slp1="vizRu">Viṣṇu</s1>, <ls>L.</ls> (especially as the first of the three sounds in the sacred syllable <s>om</s>). <info lex="m"/>
<LEND>
This example shows that the gender ending 'as' is missing from all versions.
However, there is still another version from Thomas, which shows that sometimes the gender information was present at an early stage.
For example this display: https://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/tamil/index.html
was made by Thomas before the colaboration with Peter et. al.
If you look up 'akula' you see
akula | mfn. not of good family , low ; (%{as}) m. N. of S3iva L. ; (%{A}) f. N. of Pa1rvati1 L.
-- | --
Note: The data for this display on Cologne server is in file scans/MWScan/tamil/dat/mwd.txt.
The MONIER.ALL data does not have the gender endings:
<H1>100{akula}1{a-kula}¦ •mfn. ‹not…of…good…family…,…low›
<H4>100{akula}1{a-kula}¦ •m. ‹N.…of…S3iva› ‹¯L.›
<H4>100{akulA}1{a-kulA}¦ •f. ‹N.…of…Pa1rvati1› ‹¯L.› MW000199
The 'mwd.txt' is in a different format from MONIER.ALL.
But the data for 6th homonym of 'a' in mwd.txt does NOT have the gender ending:
<en>6 m. N. of Vishnu L. (especially as the first of the three sounds in the sacred syllable %{om}). </en>
yes, this mwd.txt is what I got from the earlier cologne site.
It has ~10k gender endings; though not full trhoughout.
It has ~10k gender ending
Ok, now we are in trouble. So after 1997 and before 2004 important data was lost that is relevant to keywords.
We can explore the history of the 'lost gender endings' sometime.
Seems it does not sound too fascinating to Jim )
One interesting feature is that all the 'comp.' records are present. I found a note from a 2010 readme that many 'comp.' entries were removed intentionally; the reason not clear.
In the main pages of MW99, a little under 1000 entries of comp. word headers at the beginning of a new para are seen; but not inside any para. These are all of "in comp. for xxx" type headers, which are the variant forms that result due to grammatical rules.
However, there are many case against this in the Annexure pages. Most probably they are just indicators/markers for the indicated comp. words to be placed. Some of these indicate accent change wrt to the main page entry. These are all of "in comp." type (without "for xxx") headers.
But some of those (5 or 6) were already added into the data as "sup" entries. These could be deleted in the light of the first statement above.
@funderburkjim, what do you think we should do, about these "in comp." type (without "for xxx") headers?
Remove the 5-6 entries that were added earlier (as mentioned above), or "place" all the other "skipped" ones into the data?
The insertion of the skipped ones mandates working on couple of thousands in the main pages as well (which are not in the print as separate entries), in the name of consistency!!
The insertion of the skipped ones mandates working on couple of thousands in the main pages as well (which are not in the print as separate entries), in the name of consistency!!
May I ask for a single example?
You may just look at the very first page, and you will see about 10 such entries.
The first 'in comp.' example in Annexure is 'aMsa' (p. 1308):
Looking on p. 1, we have 'aMsa' and some compounds:
In this case, I think that the text of p. 1308 aMsa (in comp.);
is present only to facilitate the reading of the Annexure; it is proper that this text not be repeated as we merge the annexure entries into the body of the text.
Rather the merger here requires only:
Here are all '(in comp.)' examples noticed on p. 1308 of Annexure.
p. 1308, col 1 (in comp.)
aMsa no
aMho yes
col 2:
akfta no
2 akza no
4 akza no
col 3:
akzi no
akzoBya no
aKaRqa no
agastya no
Those marked 'no' do not need to have the text '(in comp.)' entered -- they are analogous to 'aMsa'.
But 'aMho' does need a 'sup' entry with (in <ab>comp.</ab> for <s>aMhas</s>)
text.
That's the way it looks to me.
What do you think, @Andhrabharati ?
Good to see your response after a long interval, @funderburkjim !
This is exactly what I mentioned above earlier.
So now we need to just add about 50+ "in comp. for xxx" type headers from the annexure pages; and then remove the 5-6 "in comp." type (without "for xxx") headers.
I've made a few changes to MW elsewhere, and would like to make mw_iast.txt consistent. Would that be ok with you if I do that now, @Andhrabharati ?
The following is the beginning portion of the file I was doing (but stopped)- (in comp. for xxx) headers -1.txt
I can resume the work in a day or two. [Presently (for about a week now) I am working on Vacaspatyam, and it is coming out quite well. Did some formatting, editorial corrections, and now resolving the abbr. (finished all 40+ occurances entries so far and some more). Would need your opinion once I finish the rest, to continue further.]
I've made a few changes to MW elsewhere, and would like to make mw_iast.txt consistent. Would that be ok with you if I do that now, @Andhrabharati ?
Absolutely no issues. Pl. go ahead.
I've seen your recent 4th April update as the latest. Is there any further one?
Why don't I do the work to integrate your in.comp.for.xxx.headers.-1.txt now, before you do further work on mw.
ok; then you can get the file by my tomorrow night.
(the full file)
Will await your notification before proceeding.
it is proper that this text not be repeated as we merge the annexure entries into the body of the text.
Agree.
@funderburkjim
I tried to open the MW file(s) to resume the work other day; but my mind seems not willing to get diverted from the present Vacaspatyam abbr.s task.
So continuing with Vacaspatyam, and so far finished all 5 & above occurrences and part of the rest.
I might need another 3-4 days to finish this work, so that I can be back to MW again.
If you wish, you may go ahead with whatever updates you have in mind on MW.
(I would suggest you to not to touch the portion of the comp. word headers that I gave before- to do that piece of work as a complete one, once I take it up and finish.)
I might need another 3-4 days to finish this work, so that I can be back to MW again.
I'm absolutely in love with what I see.
@funderburkjim
This is the issue I was referring at https://github.com/sanskrit-lexicon/SKD/issues/16#issuecomment-1355955825.
Hoping that you would be covering the annexure pages in your present work, I guess this issue may be closed now.
As I would be covering this point in my current review, this issue is closable now.
I would be posting the split portions here, as I complete them.