MWS accent correction, continue, phase 4

funderburkjim commented 1 year ago

Further review of accents in MWS., based on the version of MW at sanskrit-lexicon/MWS#142; Namely, version of mw.txt in sanskrit-lexicon/csl-orig repository at v02/mw/mw.txt at commit 360db2b.

funderburkjim commented 1 year ago

@Andhrabharati Here is the IAST version for your use.

temp1_mw_extra_iast.zip NOTE: above not used. Instead use temp_mw_01_iast.zip mentioned in comment below I'm handing the baton to you for further accent correction.

Request: Please return to me a file WITH THE SAME NUMBER OF LINES. This will make it much easier for me to analyze what you have done and then install into csl-orig.

If you find the need to change the number of lines, please defer these.

If you need to delete an entry, you can replace each line of that entry with an empty line. I'll remove the empty lines
If you need to add an entry, just make a note of it in a separate file (or comment). I'll do the insert after processing the rest of the file you return.

You may want to review the two_accent.txt file of sanskrit-lexicon/MWS#142, or you may prefer to let @AnnaRybakovaT do this.

LOOKING FORWARD TO WHAT YOU FIND!

Andhrabharati commented 1 year ago

@funderburkjim

I can start this only after a couple of days.

Meanwhile, can this update (so far done) be made public?

I am sure no one would notice the difference, as has been the case all these years!!

[This would be a silent (but worthy) improvement on the existing data.]

funderburkjim commented 1 year ago

The changes (of sanskrit-lexicon/MWS#142) already are public!

I'll aim to install the latest user corrections to MW before you turn to accent review, and will post comment here when these user corrections have been made,, and will revise your iast version accordingly.

Andhrabharati commented 1 year ago

@funderburkjim

If @AnnaRybakovaT is to take up the two-accents file, she can easily finish the task with the 'tool' that I suggested at https://github.com/sanskrit-lexicon/MWS/issues/142#issuecomment-1362341730

funderburkjim commented 1 year ago

The purpose of reviewing the two-accent file is to compare consistency of mw.txt (as shown in the two_accent.txt file) with the mw scans for this collection of headwords. For each metaline of the two_accent file, either

MW scan explicitly shows two accents aligning with the accents of two_accent.txt
or there is a difference, in which case mw.txt needs to be corrected to agree with the scan.

PW is not needed for this analysis.
We are not attempting to get consistency between MW and PW, which would be a much more subtle determination. We are attempting to get mw.txt consistent with mw scan in the 'accent' dimension.

Andhrabharati commented 1 year ago

OK, this is a fairly simple task then.

Andhrabharati commented 1 year ago

@funderburkjim

Just opened the file and seen that it contains 178 cases of two (or more) accents, and not 177 as mentioned by you.

The line <L>45102.1<pc>257,3<k1>kartavE<k2>ka/rtavE/<e>1 is missed in your search for some reason.

I used the same regex-- <k2>[^<]*[\/^][^<,]*[\/^] as you did.

gasyoun commented 1 year ago

not attempting to get consistency between MW and PW, which would be a much more subtle determination.

@Andhrabharati do you believe it would ever make sense?

Andhrabharati commented 1 year ago

Strictly speaking, YES; they should tally when the same word is being referred to.

I would just bring up the point that BR have chosen to put accents on Devanagari text, and MW opted to put them on Roman text.

It is the fault of any one to consider that they are having different notations; it is just the script difference, no absolute accent difference between them. If they are to be transcribed into any other scripts for comparison, they have to tally -- no 2nd thought on this point.

And, the -ar and few other endings that Boethlingk had opted could easily be taken care of to be in sync with all others' works. [Its not at all a big deal, if one is serious enough.]

Andhrabharati commented 1 year ago

@funderburkjim

Here are my remarks wrt your readme_extra file-- readme_extra (AB).txt

You may go through these once and take appropriate action.

gasyoun commented 1 year ago

scripts for comparison, they have to tally -- no 2nd thought on this point.

That would be of utmost interest, as I'm interested in an index - index to all the words from Sanskrit dictionaries, accents included, where known.

And, the -ar and few other endings that Boethlingk had opted could easily be taken care of to be in sync with all others' works. [Its not at all a big deal, if one is serious enough.]

Yes, all such issues are noted. Not tens of them.

Andhrabharati commented 1 year ago

As the two_accent file is very small, started looking at it and seen 100 entries so far (out of 178).

Noted 44 <k2> corrections (and accordingly in the following header line's <s> text), and 3 <k1> corrections out of the 100. [I'm looking mostly at the meta-line alone, in a constrained manner!!]

Andhrabharati commented 1 year ago

Also noted few cases, where the accent is added when not in print.

Possibly, there might be contra-cases where the accent in print got missed in the text. [We've seen many such cases in the annexure portion, when we looked at it about 2 years back.]

This makes me to think (and decide) on reading the full HWs once, instead of just the entries with accent marks (in the text), in my next perusal.

Andhrabharati commented 1 year ago

The changes (of sanskrit-lexicon/MWS#142) already are public!

@funderburkjim

I see that the CDSL search has agnī-varuṇau only,

as against the agnī́-váruṇau in the iast file I got from you.

What is causing this difference?

Andhrabharati commented 1 year ago

Finished looking at the 178 entries identified above.

And the file with my remarks for your necessary action is hereunder, @funderburkjim -- two_accents_iast (AB).txt

[I am sure that you can identify the differing entries very easily from this file; as such I did not mark them separately.] ------------------- Some interesting findings (in the full iast file)--

there are ~60 <k2> strings having a space in between, of which some are to be removed.
there are ~460 <k2> strings having a comma in between, which should properly be marked and listed as OR group entities.

funderburkjim commented 1 year ago

@Andhrabharati I have passed the baton to you. Thus I will leave these corrections to you to do. Remember -- I am working on 'user corrections' today for mw, and will likely finish tomorrow. At that time I will construct a new iast version for your use and post it here. Then, you can put your corrections into that file.

Andhrabharati commented 1 year ago

OK; then, things will take place sometime later at my end.

I haven't touched the iast file yet, except for random browsing.

funderburkjim commented 1 year ago

Thanks!

Andhrabharati commented 1 year ago

just a small query @funderburkjim , would you've left the corrections in these two files, if they were undertaken by @AnnaRybakovaT ?

funderburkjim commented 1 year ago

It is better that you make the corrections mentioned in readme_extra.AB.txt, and two_accents_iast.AB.txt, since you fully understand exactly what needs to be done. If Anna had reviewed two_accents, probably you and she would have worked together, and in the end you would have included the corrections in your iast file.

Once you've made corrections to iast file and made the revised file available, my task will include such steps as:

convert your revised iast file to slp1
validate the xml file generated by the edited slp1 file
- resolve any discrepancies -- probably there will be either no xml problems, or only a very few
make a change file comparing (1) the slp1 file before your editing and (2) the slp1 file after your editing
- this is where maintaining the number of lines is important
Examine the change file until I understand what you did.
- Use your notes (such as you made in the two AB files) to help me follow your thinking
- Resolve any questions that may arise in this examination of changes
Deal with any extra entries that you collect (either deletions or additions) that change the number of lines
- you should keep such in a separate file.
- This step is a bit awkward. Hopefully there will be only a few of these. We may have to develop some new procedure to make this less burdensome.
install the revised file in csl-orig, and do what is necessary (in csl-orig and csl-pywork) to make the revisions public.

Andhrabharati commented 1 year ago

Hopefully there will be only a few of these. We may have to develop some new procedure to make this less burdensome.

there are ~460 strings having a comma in between, which should properly be marked and listed as OR group entities.

I guess these, and any more such if found in my perusal, would have to make addl. entries, with appropeiate taggings.

gasyoun commented 1 year ago

This makes me to think (and decide) on reading the full HWs once, instead of just the entries with accent marks (in the text), in my next perusal.

Hurray.

there are ~60 strings having a space in between, of which some are to be removed. there are ~460 strings having a comma in between, which should properly be marked and listed as OR group entities.

Would love to hear more details in 2023.

funderburkjim commented 1 year ago

I'm not sure how to get your 460.

My first thought was that by 'comma between' you were talking about commas in the 'k2' field of metalines, which occurs when a word shows two accent patterns. I include the 'multiple accent patterns' comment below for possible future reference.

But then I realized that you were likely talking about cases that could be considered as alternate headwords which have been 'missed' -- and any such might very well require new entries.

I thought you were going to focus on corrections to accent markup.

If these multiple headwords are top of mind for you now, we can deal with them first AND IN A SEPARATE ISSUE. It will help me think with you on this if you provide a file of the 460. Or, consideration of these 460 can be done after your accent correction -- your choice.

funderburkjim commented 1 year ago

multiple accent patterns

identified by Comma in k2 due to multiple accent types

189 matches for "<info or="[^";]+"

15 matches for <info and="[^";]+

This takes care of 189+15 = 214 of your 460 cases.

160 matches for <L>.*?<k2>.*?,.*<e>[1-4][A]$
None of these have, nor need to have, <info or/ clauses For some headwords (e.g. acyuta), there is a comma in first entry (L=1909), and there definitely should be an accompanying <info or in the first entry. But what about the next 'A' entry L=1909.1 ? Sometimes (as in L=1909.1), I copied the k2 of the 'parent' L=1909, and more often I did not so copy. In the cases where I did copy the parent k2, I never made an <info or in the child (e.g. L=1909.1). PROBABLY it would be better never to copy the k2 from parent to child,

In other words, these 160 account for 160 of your 460. T 115 matches for <L>.*?<k2>.*?,.*<e>[1-4][B]$

Partially similar with the 'B' cases (and 2 'C' cases).

funderburkjim commented 1 year ago

starting point for Andhrabharati

I've now finished user corrections for mw.

temp_mw_01_iast.zip is ready for you. (construction notes are in readme_iast.txt.

Andhrabharati commented 1 year ago

Just 36 corrections!

@funderburkjim

Would you be willing to consider my working, if I do some (kind of) major changes? Of course, I do understand the need for keeping the no. of lines the same well and would abide by that 'norm'.

I will go step-by-step, so that you can 'follow' the process without much effort.

As I have to look at every page and entry, I think it is the best chance to make use of this opportunity to read the text fully wrt the print, and incorporate necessary changes/corrections in the mw text.

I have some good reasons, to have decided on taking up this path.

Andhrabharati commented 1 year ago

The 'additional' work (from what you listed above) I foresee at your end is mostly to write (or re-run, as I am sure you would've already wrote those earlier) some small programs, to correlate my working with CDSL text.

drdhaval2785 commented 1 year ago

I think @Andhrabharati has been constantly shown his interest to do major overhauls in one go. I think, the way we can do this is like the following.

@Andhrabharati makes changes for a small section (maybe 10-20 pages) and give back the file for that length.
He also documents the changes he has made like new markups, new conventions he has followed.
@funderburkjim or @drdhaval2785 would analyse the file in step 1 with noted of state 2 and check whether the changes are reversible or not, with help of some program.
Once all of us are OK with the changes proposed, we can do the same for the whole dictionary.

Does this make sense to all concerned? Iff this goes through, we would be able to take maximum advantage of @Andhrabharati’s potentials.

Andhrabharati commented 1 year ago

Quite happy to see you coming in, @drdhaval2785 !!

You have come up with a good proposal and I would like to say that we two can do the working on MW, and involve Jim at a later (final ?) stage, so that he can be on other major works-- PWG, pwk and I am shortly going to offer him a similar work (biblio-related) on Vacaspatyam(!!). [I'd just hope that you don't "disappear" for a long duration in-between.]

Andhrabharati commented 1 year ago

Observation-1a:

There are two lines (575366 & 575369) having two broken vertical bars; there should be only one per line.

<s>yāvac—chakti</s> ¦ (for <s>-śak°</s>; <ls>A.</ls>) or <s>yāvac—chak°ti-tas</s> (<ls>Kād.</ls>), ¦ <lex>ind.</lex> according to power.

to be changed as

<s>yāvac—chakti</s> (for <s>-śak°</s>; <ls>A.</ls>) or <s>yāvac—chak°ti-tas</s> (<ls>Kād.</ls>), ¦ <lex>ind.</lex> according to power.

I would be looking for trivial (and non-intrusive) errors as well (like this) in my working.

drdhaval2785 commented 1 year ago

@Andhrabharati

I am OK with the suggestion put forward by you. I may not be able to chip in on daily basis, but maybe weekly basis.

Andhrabharati commented 1 year ago

Well understood, @drdhaval2785 !

Even I am not looking for any daily works. Shall we wait for @funderburkjim to give his nod, or do we move ahead? Of course I might be starting in the new year (2023), as @gasyoun said somewhere above.

Andhrabharati commented 1 year ago

And I would like to close this 'accent' related issue with the corrections based on my above posted two files, and start a new issue "Thorough review of MW text", as my proposal is far beyond just accents.

Andhrabharati commented 1 year ago

Observation-1b: There are one case each of ,¦ (660424) and ¦, (102199) which need to be changed as , ¦

Observation-1c: There are just 6342 instances of , ¦, while there are 214073 instances of [^,] ¦; i.e. almost everywhere in the text the comma is missing before the broken bar in the body-line.

A cursory look at any page of MW clearly shows that every HW entry has a comma separating it with the word-ending (if given) or the gender info, before the meaning etc. is started. This is the notation adopted by MW. So, we need to insert the comma almost in all those 214K cases.

Observation-1d: There are almost 2100 lines having ¦ <s>, and I see that most of them need some correction (relocating the ¦, and/or some addl. marking) to make them 'truly represent' what the print book indicates.

This finishes my study on the body marker ¦.

Andhrabharati commented 1 year ago

Now I wait for the response from @funderburkjim and @drdhaval2785 ; to know if they agree in how I am going to work with the MW text, and are willing to incorporate all such corrections in the CDSL file.

If they feel that I am doing some irrelevant and uncalled for (extra) work, I don't have to proceed this way; but will try to limit myself to what they suggest, so that my time and effort are spent in useful manner.

funderburkjim commented 1 year ago

The organizational ideas above by Andhrabharati and Dhaval seem constructive. As in Dhaval's 4-points, let's start slowly, methodically. And with good documentation.

I would like to be kept in the loop in the beginning. Once the method stabilizes, I would be happy to have my main function reduced to installation of revisions on Cologne server (currently I am the only one with ssh access to this server).

Suggest each 'step' ( or small set of steps) have

its own issue https://github.com/sanskrit-lexicon/MWS/issues/NNN
and its related subdirectory https://github.com/sanskrit-lexicon/MWS./mwsissues/issueNNN
- for 'change' files, related python or other scripts, etc.

Let @Andhrabharati do a first step to get this multiple-step process started!

Andhrabharati commented 1 year ago

Good to hear this, @funderburkjim !

Hopefully, by next Christmas we'd have MW text brought closer to the print with some value added markings.

Merry Christmas to you, and thanks for spending time to correct the accents portion to a major extent; I do not know if anyone earlier had 'bothered' about this important point, but I sure did.

gasyoun commented 1 year ago

I have some good reasons, to have decided on taking up this path.

I believe we are ready for the yearly Skype call. How about 5th of January? We had it 12 am NY time @funderburkjim?

"Thorough review of MW text", as my proposal is far beyond just accents.

Of much interest to listen to the scope over Skype.

Hopefully, by next Christmas we'd have MW text brought closer to the print with some value added markings.

Yeah a year of weekly loops sounds reasonable.

thanks for spending time to correct the accents portion to a major extent; I do not know if anyone earlier had 'bothered' about this important point, but I sure did.

As far as I'm aware it was never considered an issue. For me it's important because now I will be able to add accents to my index of all known Sanskrit words.

Andhrabharati commented 1 year ago

And I would like to close this 'accent' related issue with the corrections based on my above posted two files, and start a new issue "Thorough review of MW text", as my proposal is far beyond just accents.

Let @Andhrabharati do a first step to get this multiple-step process started!

On a 2nd thought, I think it is better to close this issue here itself as is (as these points would anyway be covered in the wholesome reading), and start a new issue for a full 'reviewing'.

And I have separated out the trailing 'info' tags, and also removed the slp1 texts throughout (under 's1' tags: 53100 and 'ab n=' tags: 2540). [The slp1 strings could easily be regenerated (if &) when required; and the info tags could appropriately be attached (from the file) at the end or regenerated. Personally, I consider all other tags as just fanciful, except the 'sup, rev, and, or' tags which have some practical use.]

This facilitates a free (and unobtrusive) reading of the file.

Here are the two files, that I have made from the temp_mw_01_iast.txt file-- temp_mw_01_iast_plain (AB).zip trailing info tags.zip

I hope this is acceptable.

funderburkjim commented 1 year ago

Do you have Python on your computer? I

funderburkjim commented 1 year ago

two useful programs

Two programs added to issue145: diff_to_changes_dict.py and updateByLine.py.

These help to analyze what AB did.

Andhrabharati commented 1 year ago

Do you have Python on your computer? I

Yes, I have it.

Andhrabharati commented 1 year ago

These help to analyze what AB did.

I just used regex process, no programming.

funderburkjim commented 1 year ago

file naming convention

It is awkward to have space-characters in file names. I renamed

temp_mw_01_iast_plain (AB).txt to temp_mw_02_iast.txt
trailing info tags.txt to temp_trailing_info.txt

significance of 'temp' filenames

The .gitignore file in this repository has a statement 'temp*', which means that filenames starting with 'temp' are ignored by GIT. i.e., they exist locally, but are ignored when commits are made. They are 'working files', but don't cause bloat at Github.

There is some art in deciding what should be temp and what should not be temp.

funderburkjim commented 1 year ago

temp_mw_01_iast.txt is the original iast version (for this repository). temp_mw_02_iast.txt is AB's revision.

Both files have the same number of lines. Hurray!

 wc -l temp_mw_*_iast.txt
   880519 temp_mw_01_iast.txt
   880519 temp_mw_02_iast.txt

get file of changes

This is where the diff_to_changes.py file is useful. It is applicable since the two files have the same number of lines

# Create temp_change_02_iast.txt
I chose the output to be a 'temp' file, since I'm not sure at this point what to expect
python diff_to_changes_dict.py temp_mw_01_iast.txt temp_mw_02_iast.txt temp_change_02_iast.txt
# This is printed to terminal
#   276923 changes written to temp_change_02_iast.txt
# WOW a lot of lines are changed. Will discuss my impressions of these below.

updateByLine.py

updateByLine.py constructs a new file from an old file and a change file. In the present situation it is really not needed, but we can use it as follows to check all is well:

python updateByLine.py temp_mw_01_iast.txt temp_change_02_iast.txt temp.txt

Output to terminal is

880519 lines read from temp_mw_01_iast.txt
880519 records written to temp.txt
276923 change transactions from temp_change_02_iast.txt
276923 of type new

Now we can compare temp_mw_02_iast.txt and temp.txt, using diff utility (part of git bash terminal)

diff temp_mw_02_iast.txt temp.txt | wc -l
#  ouitput is 0.  This means there are 0 lines that differ.  i.e. there is no difference.  
#  This is as expected.

funderburkjim commented 1 year ago

Wrong issue!

Just noticed I should have put these comments in sanskrit-lexicon/mw-dev#2!

I've copied issue145 folder to issue146.

Agree that we can close this issue145 now.

funderburkjim commented 1 year ago

believe we are ready for the yearly Skype call. How about 5th of January? We had it 12 am NY time

@gasyoun That date/time is fine with me.

gasyoun commented 1 year ago

That date/time is fine with me.

Right? @Andhrabharati , @drdhaval2785 , @SergeA ?

erwewrerwwe

drdhaval2785 commented 1 year ago

I will not be able to join on that day. Saturday or Sunday after 4 pm IST would be suitable for me.

funderburkjim commented 1 year ago

Saturday better than Sunday. Saturday 4PM IST = Saturday 5:30AM EST Quite early.

What about Saturday 8PM IST = Saturday 9:30AM EST =? Saturday 7:30PM MSK

sanskrit-lexicon / MWS