Closed funderburkjim closed 1 year ago
@Andhrabharati Here is the IAST version for your use.
temp1_mw_extra_iast.zip NOTE: above not used. Instead use temp_mw_01_iast.zip mentioned in comment below I'm handing the baton to you for further accent correction.
Request: Please return to me a file WITH THE SAME NUMBER OF LINES. This will make it much easier for me to analyze what you have done and then install into csl-orig.
If you find the need to change the number of lines, please defer these.
You may want to review the two_accent.txt file of sanskrit-lexicon/MWS#142, or you may prefer to let @AnnaRybakovaT do this.
LOOKING FORWARD TO WHAT YOU FIND!
@funderburkjim
I can start this only after a couple of days.
Meanwhile, can this update (so far done) be made public?
I am sure no one would notice the difference, as has been the case all these years!!
[This would be a silent (but worthy) improvement on the existing data.]
The changes (of sanskrit-lexicon/MWS#142) already are public!
I'll aim to install the latest user corrections to MW before you turn to accent review, and will post comment here when these user corrections have been made,, and will revise your iast version accordingly.
@funderburkjim
If @AnnaRybakovaT is to take up the two-accents file, she can easily finish the task with the 'tool' that I suggested at https://github.com/sanskrit-lexicon/MWS/issues/142#issuecomment-1362341730
The purpose of reviewing the two-accent file is to compare consistency of mw.txt (as shown in the two_accent.txt file) with the mw scans for this collection of headwords. For each metaline of the two_accent file, either
PW is not needed for this analysis.
We are not attempting to get consistency between MW and PW, which would be a much more subtle determination.
We are attempting to get mw.txt consistent with mw scan in the 'accent' dimension.
OK, this is a fairly simple task then.
@funderburkjim
Just opened the file and seen that it contains 178 cases of two (or more) accents, and not 177 as mentioned by you.
The line <L>45102.1<pc>257,3<k1>kartavE<k2>ka/rtavE/<e>1
is missed in your search for some reason.
I used the same regex-- <k2>[^<]*[\/^][^<,]*[\/^]
as you did.
not attempting to get consistency between MW and PW, which would be a much more subtle determination.
@Andhrabharati do you believe it would ever make sense?
Strictly speaking, YES; they should tally when the same word is being referred to.
I would just bring up the point that BR have chosen to put accents on Devanagari text, and MW opted to put them on Roman text.
It is the fault of any one to consider that they are having different notations; it is just the script difference, no absolute accent difference between them. If they are to be transcribed into any other scripts for comparison, they have to tally -- no 2nd thought on this point.
And, the -ar and few other endings that Boethlingk had opted could easily be taken care of to be in sync with all others' works. [Its not at all a big deal, if one is serious enough.]
@funderburkjim
Here are my remarks wrt your readme_extra file-- readme_extra (AB).txt
You may go through these once and take appropriate action.
scripts for comparison, they have to tally -- no 2nd thought on this point.
That would be of utmost interest, as I'm interested in an index - index to all the words from Sanskrit dictionaries, accents included, where known.
And, the -ar and few other endings that Boethlingk had opted could easily be taken care of to be in sync with all others' works. [Its not at all a big deal, if one is serious enough.]
Yes, all such issues are noted. Not tens of them.
As the two_accent file is very small, started looking at it and seen 100 entries so far (out of 178).
Noted 44 <k2>
corrections (and accordingly in the following header line's <s>
text), and 3 <k1>
corrections out of the 100.
[I'm looking mostly at the meta-line alone, in a constrained manner!!]
Also noted few cases, where the accent is added when not in print.
Possibly, there might be contra-cases where the accent in print got missed in the text. [We've seen many such cases in the annexure portion, when we looked at it about 2 years back.]
This makes me to think (and decide) on reading the full HWs once, instead of just the entries with accent marks (in the text), in my next perusal.
The changes (of sanskrit-lexicon/MWS#142) already are public!
@funderburkjim
I see that the CDSL search has agnī-varuṇau only,
as against the agnī́-váruṇau in the iast file I got from you.
What is causing this difference?
Finished looking at the 178 entries identified above.
And the file with my remarks for your necessary action is hereunder, @funderburkjim -- two_accents_iast (AB).txt
[I am sure that you can identify the differing entries very easily from this file; as such I did not mark them separately.]
-------------------
Some interesting findings (in the full iast file)--
<k2>
strings having a space in between, of which some are to be removed.<k2>
strings having a comma in between, which should properly be marked and listed as OR group entities.@Andhrabharati I have passed the baton to you. Thus I will leave these corrections to you to do. Remember -- I am working on 'user corrections' today for mw, and will likely finish tomorrow. At that time I will construct a new iast version for your use and post it here. Then, you can put your corrections into that file.
OK; then, things will take place sometime later at my end.
I haven't touched the iast file yet, except for random browsing.
Thanks!
just a small query @funderburkjim , would you've left the corrections in these two files, if they were undertaken by @AnnaRybakovaT ?
It is better that you make the corrections mentioned in readme_extra.AB.txt, and two_accents_iast.AB.txt, since you fully understand exactly what needs to be done. If Anna had reviewed two_accents, probably you and she would have worked together, and in the end you would have included the corrections in your iast file.
Once you've made corrections to iast file and made the revised file available, my task will include such steps as:
Hopefully there will be only a few of these. We may have to develop some new procedure to make this less burdensome.
- there are ~460
strings having a comma in between, which should properly be marked and listed as OR group entities.
I guess these, and any more such if found in my perusal, would have to make addl. entries, with appropeiate taggings.
This makes me to think (and decide) on reading the full HWs once, instead of just the entries with accent marks (in the text), in my next perusal.
Hurray.
there are ~60
strings having a space in between, of which some are to be removed. there are ~460 strings having a comma in between, which should properly be marked and listed as OR group entities.
Would love to hear more details in 2023.
I'm not sure how to get your 460.
My first thought was that by 'comma between' you were talking about commas in the 'k2' field of metalines, which occurs when a word shows two accent patterns. I include the 'multiple accent patterns' comment below for possible future reference.
But then I realized that you were likely talking about cases that could be considered as alternate headwords which have been 'missed' -- and any such might very well require new entries.
I thought you were going to focus on corrections to accent markup.
If these multiple headwords are top of mind for you now, we can deal with them first AND IN A SEPARATE ISSUE. It will help me think with you on this if you provide a file of the 460. Or, consideration of these 460 can be done after your accent correction -- your choice.
identified by Comma in k2 due to multiple accent types
189 matches for "<info or="[^";]+"
15 matches for <info and="[^";]+
This takes care of 189+15 = 214 of your 460 cases.
160 matches for <L>.*?<k2>.*?,.*<e>[1-4][A]$
None of these have, nor need to have, <info or/
clauses
For some headwords (e.g. acyuta), there is a comma in first entry (L=1909), and there definitely should
be an accompanying <info or
in the first entry.
But what about the next 'A' entry L=1909.1 ?
Sometimes (as in L=1909.1), I copied the k2 of the 'parent' L=1909,
and more often I did not so copy. In the cases where I did copy the parent k2, I never made an
<info or
in the child (e.g. L=1909.1).
PROBABLY it would be better never to copy the k2 from parent to child,
In other words, these 160 account for 160 of your 460. T
115 matches for <L>.*?<k2>.*?,.*<e>[1-4][B]$
Partially similar with the 'B' cases (and 2 'C' cases).
I've now finished user corrections for mw.
temp_mw_01_iast.zip is ready for you. (construction notes are in readme_iast.txt.
Just 36 corrections!
@funderburkjim
Would you be willing to consider my working, if I do some (kind of) major changes? Of course, I do understand the need for keeping the no. of lines the same well and would abide by that 'norm'.
I will go step-by-step, so that you can 'follow' the process without much effort.
As I have to look at every page and entry, I think it is the best chance to make use of this opportunity to read the text fully wrt the print, and incorporate necessary changes/corrections in the mw text.
I have some good reasons, to have decided on taking up this path.
The 'additional' work (from what you listed above) I foresee at your end is mostly to write (or re-run, as I am sure you would've already wrote those earlier) some small programs, to correlate my working with CDSL text.
I think @Andhrabharati has been constantly shown his interest to do major overhauls in one go. I think, the way we can do this is like the following.
Does this make sense to all concerned? Iff this goes through, we would be able to take maximum advantage of @Andhrabharati’s potentials.
Quite happy to see you coming in, @drdhaval2785 !!
You have come up with a good proposal and I would like to say that we two can do the working on MW, and involve Jim at a later (final ?) stage, so that he can be on other major works-- PWG, pwk and I am shortly going to offer him a similar work (biblio-related) on Vacaspatyam(!!). [I'd just hope that you don't "disappear" for a long duration in-between.]
Observation-1a:
There are two lines (575366 & 575369) having two broken vertical bars; there should be only one per line.
<s>yāvac—chakti</s> ¦ (for <s>-śak°</s>; <ls>A.</ls>) or <s>yāvac—chak°ti-tas</s> (<ls>Kād.</ls>), ¦ <lex>ind.</lex> according to power.
to be changed as
<s>yāvac—chakti</s> (for <s>-śak°</s>; <ls>A.</ls>) or <s>yāvac—chak°ti-tas</s> (<ls>Kād.</ls>), ¦ <lex>ind.</lex> according to power.
I would be looking for trivial (and non-intrusive) errors as well (like this) in my working.
@Andhrabharati
I am OK with the suggestion put forward by you. I may not be able to chip in on daily basis, but maybe weekly basis.
Well understood, @drdhaval2785 !
Even I am not looking for any daily works. Shall we wait for @funderburkjim to give his nod, or do we move ahead? Of course I might be starting in the new year (2023), as @gasyoun said somewhere above.
And I would like to close this 'accent' related issue with the corrections based on my above posted two files, and start a new issue "Thorough review of MW text", as my proposal is far beyond just accents.
Observation-1b:
There are one case each of ,¦
(660424) and ¦,
(102199) which need to be changed as , ¦
Observation-1c:
There are just 6342 instances of , ¦
, while there are 214073 instances of [^,] ¦
; i.e. almost everywhere in the text the comma is missing before the broken bar in the body-line.
A cursory look at any page of MW clearly shows that every HW entry has a comma separating it with the word-ending (if given) or the gender info, before the meaning etc. is started. This is the notation adopted by MW. So, we need to insert the comma almost in all those 214K cases.
Observation-1d:
There are almost 2100 lines having ¦ <s>
, and I see that most of them need some correction (relocating the ¦, and/or some addl. marking) to make them 'truly represent' what the print book indicates.
This finishes my study on the body marker ¦
.
Now I wait for the response from @funderburkjim and @drdhaval2785 ; to know if they agree in how I am going to work with the MW text, and are willing to incorporate all such corrections in the CDSL file.
If they feel that I am doing some irrelevant and uncalled for (extra) work, I don't have to proceed this way; but will try to limit myself to what they suggest, so that my time and effort are spent in useful manner.
The organizational ideas above by Andhrabharati and Dhaval seem constructive. As in Dhaval's 4-points, let's start slowly, methodically. And with good documentation.
I would like to be kept in the loop in the beginning. Once the method stabilizes, I would be happy to have my main function reduced to installation of revisions on Cologne server (currently I am the only one with ssh access to this server).
Suggest each 'step' ( or small set of steps) have
https://github.com/sanskrit-lexicon/MWS/issues/NNN
https://github.com/sanskrit-lexicon/MWS./mwsissues/issueNNN
Let @Andhrabharati do a first step to get this multiple-step process started!
Good to hear this, @funderburkjim !
Hopefully, by next Christmas we'd have MW text brought closer to the print with some value added markings.
Merry Christmas to you, and thanks for spending time to correct the accents portion to a major extent; I do not know if anyone earlier had 'bothered' about this important point, but I sure did.
I have some good reasons, to have decided on taking up this path.
I believe we are ready for the yearly Skype call. How about 5th of January? We had it 12 am NY time @funderburkjim?
"Thorough review of MW text", as my proposal is far beyond just accents.
Of much interest to listen to the scope over Skype.
Hopefully, by next Christmas we'd have MW text brought closer to the print with some value added markings.
Yeah a year of weekly loops sounds reasonable.
thanks for spending time to correct the accents portion to a major extent; I do not know if anyone earlier had 'bothered' about this important point, but I sure did.
As far as I'm aware it was never considered an issue. For me it's important because now I will be able to add accents to my index of all known Sanskrit words.
And I would like to close this 'accent' related issue with the corrections based on my above posted two files, and start a new issue "Thorough review of MW text", as my proposal is far beyond just accents.
Let @Andhrabharati do a first step to get this multiple-step process started!
On a 2nd thought, I think it is better to close this issue here itself as is (as these points would anyway be covered in the wholesome reading), and start a new issue for a full 'reviewing'.
And I have separated out the trailing 'info' tags, and also removed the slp1 texts throughout (under 's1' tags: 53100 and 'ab n=' tags: 2540). [The slp1 strings could easily be regenerated (if &) when required; and the info tags could appropriately be attached (from the file) at the end or regenerated. Personally, I consider all other tags as just fanciful, except the 'sup, rev, and, or' tags which have some practical use.]
This facilitates a free (and unobtrusive) reading of the file.
Here are the two files, that I have made from the temp_mw_01_iast.txt file-- temp_mw_01_iast_plain (AB).zip trailing info tags.zip
I hope this is acceptable.
Do you have Python on your computer? I
Two programs added to issue145: diff_to_changes_dict.py and updateByLine.py.
These help to analyze what AB did.
Do you have Python on your computer? I
Yes, I have it.
These help to analyze what AB did.
I just used regex process, no programming.
It is awkward to have space-characters in file names. I renamed
temp_mw_01_iast_plain (AB).txt
to temp_mw_02_iast.txt
trailing info tags.txt
to temp_trailing_info.txt
The .gitignore file in this repository has a statement 'temp*', which means that filenames starting with 'temp' are ignored by GIT. i.e., they exist locally, but are ignored when commits are made. They are 'working files', but don't cause bloat at Github.
There is some art in deciding what should be temp and what should not be temp.
temp_mw_01_iast.txt is the original iast version (for this repository). temp_mw_02_iast.txt is AB's revision.
Both files have the same number of lines. Hurray!
wc -l temp_mw_*_iast.txt
880519 temp_mw_01_iast.txt
880519 temp_mw_02_iast.txt
This is where the diff_to_changes.py file is useful. It is applicable since the two files have the same number of lines
# Create temp_change_02_iast.txt
I chose the output to be a 'temp' file, since I'm not sure at this point what to expect
python diff_to_changes_dict.py temp_mw_01_iast.txt temp_mw_02_iast.txt temp_change_02_iast.txt
# This is printed to terminal
# 276923 changes written to temp_change_02_iast.txt
# WOW a lot of lines are changed. Will discuss my impressions of these below.
updateByLine.py constructs a new file from an old file and a change file. In the present situation it is really not needed, but we can use it as follows to check all is well:
python updateByLine.py temp_mw_01_iast.txt temp_change_02_iast.txt temp.txt
Output to terminal is
880519 lines read from temp_mw_01_iast.txt
880519 records written to temp.txt
276923 change transactions from temp_change_02_iast.txt
276923 of type new
Now we can compare temp_mw_02_iast.txt and temp.txt, using diff utility (part of git bash terminal)
diff temp_mw_02_iast.txt temp.txt | wc -l
# ouitput is 0. This means there are 0 lines that differ. i.e. there is no difference.
# This is as expected.
Just noticed I should have put these comments in sanskrit-lexicon/mw-dev#2!
I've copied issue145 folder to issue146.
Agree that we can close this issue145 now.
believe we are ready for the yearly Skype call. How about 5th of January? We had it 12 am NY time
@gasyoun That date/time is fine with me.
That date/time is fine with me.
Right? @Andhrabharati , @drdhaval2785 , @SergeA ?
I will not be able to join on that day. Saturday or Sunday after 4 pm IST would be suitable for me.
Saturday better than Sunday. Saturday 4PM IST = Saturday 5:30AM EST Quite early.
What about Saturday 8PM IST = Saturday 9:30AM EST =? Saturday 7:30PM MSK
Further review of accents in MWS., based on the version of MW at sanskrit-lexicon/MWS#142; Namely, version of mw.txt in sanskrit-lexicon/csl-orig repository at v02/mw/mw.txt at commit 360db2b.