Closed funderburkjim closed 1 year ago
[Sorry that I am progressing further, before the 1st file content is accepted.] [Reminder: §4 and §6 are yet to be taken up]
Thanks, @Andhrabharati
These are in addition to those mentioned several comments up. Most, if not all, are mentioned in AB's documentation above.
1 &c.</ls>
-> </ls> &c.
</s>+ <s>
=> +
</s>√ <s>
=> √
</s> √ <s>
=> √
</s> <s>
=>
</ls>&c.
=> </ls> &c.
&c.</ls>
=> </ls> &c.
- √
=> -√
&</ls>
=> </ls> &
As far as I can tell, the metaline problems mentioned with mw_iast_AB_1.txt have not been corrected in mw_iast_AB_2.txt. The following file shows all cases (900+) where mw_iast_AB_2.txt metaline differs from the original mw_iast.txt. Most, perhaps all, of these differences are inadvertent errors in mw_iast_AB_2.txt.
Highest priority is to correct those metalines in mw_iast_AB_2.txt. @Andhrabharati can you do this?
This is the file mentioned in previous comment:
Because of the rather large number (900+) of metaline problems, it might be safer to write a program to revert all the metalines in mw_iast_AB_2.txt back to their value in mw_iast.txt. Let me know if you prefer this program approach, and I'll do it tomorrow.
Now that you are agreeing on the "good corrections", why not you do them programmatically, once I point them? (This was my initial expectation starting with just mentioning the points.)
This would eliminate my manual correction errors and we can close them quickly.
Once these "punctuation related" items are done with, which have a role to play in next phases of processes, the actual "text related" corrections could be started.
But let me check why these metaline errors remain in my file; it is a matter of highest concern.
Once these "punctuation related" items are done with, which have a role to play in next phases of processes, the actual "text related" corrections could be started.
Understood.
* 205 cases: `</s> <s>` => ` `
I could see only these (and 4 of them), and two are corrected with a comma between; other two are under my ;; comments, that are to be appropriately corrected.
Metaline problems remain
Highest priority is to correct those metalines in mw_iast_AB_2.txt. @Andhrabharati can you do this?
I could trace out the issue. I was doing some macro based bulk replacements on a file in another window, and this had conflicted with the manual find/replace operations here.
I am really sorry that I had wasted your time, due to this error. [Lesson to remember: Never do parallel operations with "single clipboard", esp. when working on bulk texts.]
I had corrected all those meta-lines, except some 4 or 5, which are having my ;; comments.
Now pushing my 2a file with these changes. (And I hope only the intended changes are present in my file.) [Felt bad for my error, hence not attempting more changes this time, though there are plenty in line!]
BTW, there are <page col>
tags inside the Koeln data, though not as <pc>
(as I said earlier) but as <pcol>
[about 1500 of them] which are used to specify the actual (p,c) content. So the <pb>
should have been meant only to indicate a page/column break within the entry text (as seen correctly used thus at many places); that's the reason for my comment while doing the annexure data.
Summary: This means where there is no break within the entry text, this <pcol>
tag should be used.
@Andhrabharati
Have finished a first analysis of mw_iast_AB_2a.txt.
LOOKS GOOD!
My procedure has been to:
<div
);
or R(
or (O
)) to a separate file for later reviewThe result currently is that all but one trivial difference is removed.
For some reason, you omitted the first line <?xml version="1.0" encoding="UTF-8"?>
from
mw_iast_ab_2a.txt.
Also, no problems with metalines at all (although I did notice you intentionally corrected a few
<pc>
values.)
Tomorrow I will review some of the ad-hoc kinds of changes you made -- I doubt if I'll have any complaints, but want to do a thorough examination nonetheless.
So, I think you can go ahead with your next phase, mw_iast_AB_3.txt (based on AB_2a).
Note on 205 cases: </s> <s>
=>
: These are changes made to mw_iast.txt to be in concord
with mw_iast_AB_2[a].txt.
why not you do them programmatically, once I point them
I think that is what I'm doing with my controlled sequence of alterations of mw_iast.txt.
I'll document these programmatic steps as time permits.
I took <pc>
in the metaline is to mention the start of the entry, not the end; and <pb>
to mention a (p,c) break within the entry text.
Is this understanding correct? I did not find them detailed in the one or two places where the tagging (markup) is described.
@funderburkjim
The result currently is that all but one trivial difference is removed. For some reason, you omitted the first line
<?xml version="1.0" encoding="UTF-8"?>
from mw_iast_ab_2a.txt.
This is a not intended one; happened by mistake I guess. I added the line back into the file now.
LOOKS GOOD!
I think that is what I'm doing with my controlled sequence of alterations of mw_iast.txt.
So, I think you can go ahead with your next phase, mw_iast_AB_3.txt (based on AB_2a).
May I ask you to post your updated mw_iast file after your reviews (for me to do next phase of work on it), so that I would also have some feeling of going progressively? (I understand that you're updating the file anyway).
Probably we can work on a diff. title mw_iast_AB.txt, so that no conflicts with mw_iast.txt creep-in till this reaches a sufficiently reasonable position to get back that (mw_iast.txt or mw.txt) name for public release.
Or do you wish to wait till my corrections are "over" - which I don't think would take not anything less than 3-4 months time.
The main reason why I wanted this progressive way, is not to increase the number of files at the repo, with as much size as 50MB each (even if it is of unlimited capacity).
With Github's inherent handling of files with updates (internal to the file), one single file with all changes marked in it should be the way to go, in my opinion; rather two files- one in your format, and one in my format (single lines records).
With Github's inherent handling of files with updates (internal to the file), one single file with all changes marked in it should be the way to go, in my opinion; rather two files- one in your format, and one in my format (single lines records).
2 is ok, 20 would be a mess.
@Andhrabharati
In reviewing changes of mw_iast_AB_2a.txt, I am finding a few that I think you need to change.
It will take me a day or two to finish this review and prepare the list of suggested changes.
Please wait until I do this before you post another version.
Sure @funderburkjim; in fact I am not doing any "active work" on this now, but awaiting your response to my above posting.
And here are two more points in the list (to be corrected)-
§8. The possessive indicator ('s) should not be within the <ls>
or <s>
strings, but only after them.
`'s to be replaced with 's (3 occurrences) & 's with 's `(1 occurrence)
So is the possessive '
within <ls>
to be out of <ls>
string; within <s>
however it needs a closer look, as it is also used as an avagraha (elision) mark.
§9. matching pairs
closing quote wrong : 1 line
‘perhaps it is and is not and is not expressible in words' in <L>257473
; to be corrected as ‘perhaps it is and is not and is not expressible in words’
)
count (closing brace) more than (
count (opening brace) : 12 lines
<L>1427.1, <L>6380.2, <L>15163.2, <L>26674.1, <L>32084.1, <L>80435.2, <L>81840, <L>85951.1, <L>129615, <L>135272.1, <L>136734.11, <L>214028.2
)
count (closing brace) less than (
count (opening brace) ( count : 13 lines
<L>1427.05, <L>6380.1, <L>15163.1, <L>22291.11, <L>37246.2, <L>80435.1, <L>81334.1, <L>119349.1, <L>124485, <L>136733.1, <L>138717.1, <L>139159, <L>206833
>
count (closing angle) more than <
count (opening angle) : 14 lines
<L>39985, <L>67152, <L>74853, <L>75084, <L>87700, <L>90634, <L>96141.2, <L>97263.1, <L>126361, <L>220610, <L>263600, <L>263600.1, <L>263600.2, <L>263600.3
(smart quotes)
In <L>9426.2
, `the esoteric` should be ‘the esoteric’ [with smart quote marks as everywhere else]
In <L>23830.1
, `first father' should be ‘first father’ [with smart quote marks as everywhere else]; and this is the only place where apostrophe is used as a closing quote mark!
's to be replaced with 's
More than that, ’
instead of '
https://en.wikipedia.org/wiki/List_of_typographical_symbols_and_punctuation_marks
's to be replaced with 's
More than that,
’
instead of'
That comes somewhat later in my list; presently listing the ones at/in the tags; and esp. the ones not consistent in form wrt others within the text.
§10. Ellipsis character
The 10 places where ... (three dots) or .... (4 dots) occur, those could be replaced with … the ellipsis character, which is the actual mark used in such places (by convention). (<L>7684, <L>12824, <L>16115, <L>30000.2, <L>36736, <L>39674, <L>40786, <L>42778.2, <L>74916
)
Note: <L>39674
though has the three dots and couple of + signs in the data, they are not displayed!
Now a radical point is coming up.
§11. Marking of Sanskṛit & English (or Anglicised Sanskrit) words, as mentioned by MW himself and CONSISTENTLY followed throughout the book.
Sanskṛit words (or letters) are denoted in 4 styles - (i) Main line in Nāgari type, with equivalents in Indo-Italic type[1] (ii) Subordinate line (Under the Nāgari) in thick Indo-Romanic type[1] (iii) Branch line, in thick Indo-Romanic type (iv) Branch line in Indo-Italic type
English (and Anglicised Sanskrit) words are denoted in 3 styles - (i) Normal (non-bold, non-italic) initial cap. letters with or without diacritics, for proper nous. (ii) Normal (non-bold, non-italic) initial cap. letters for plural forms. (iii) Strictly English words like Aryan, Vedic, Brahman [for equivalent आर्य (ārya), वैदिक (vaidika), ब्राह्मण (brāhmaṇa)] etc.
Seen that the such "English" words as mentioned here, are marked in both slp1 notation and another mixed (mostly the <s>
type Sanskrit) notation in the data now . [There are over 50k such words on the whole.]
Though this is highly debatable, I would strongly suggest to have them all in the "English" forms alone as in the book and probably put under the <ns>
(non-Sanskrit) tagging as was the case earlier.
[We have no "moral right" to change the author's style/idea- The marking of words as Sanskrit or English,]
(Of course, the changes such as the (sh > ṣ), (ṛi > ṛ) in trasliteration are somewhat alright as per modern/current usage for Sanskrit words, but I would say these also need be as per the book (when they are intended as English words)- the Vishnu, Krishna etc. are too popular spellings to be changed to Viṣṇu and Kṛṣṇa etc.)
§12. number not having a space or another number on either side (except when followed by st, nd, rd, & th)
[0-9]?
<ls>Kathās. vi, 58 (and 1 32?) </ls>
;; should be <ls>Kathās. vi, 58 (& 132 ?)</ls>
;; all others to have a space before ?
[0-9]c
8ch.
;; should be <ls>Sch.</ls>
<ls>Sāh. iv, 14c/v</ls>
;; should be <ls>Sāh. iv, 14 c/v</ls>
<ls>ib. 11305 0c.</ls>
;; should be <ls>ib. 11305</ls> &c.
<ls>RV. x, 12c</ls>
;; should be <ls>RV. x, 120</ls>
p.802col.2, p.1118col.1., p.1200col.3
;; all these should have a space between the number and "col."
[0-9]f : 90 occurrences
;; should be with a space before f
[0-9]i, [0-9]I
<i>upp107475iya</i>
;; to be changed to <i>uppāiya</i>
;; all others (8- i and 2- I) should be [0-9]1, not [0-9]i
[0-9]l
<ab>Introd.</ab> 5l;
;; should be 51 not 5l
<ls>MatsyaP. iii, 9l f.</ls>
;; should be 91 not 9l
0law-book
;; no 0 here
[0-9]o, [0-9]O
;; all these (12- 0 and 1- O) should be [0-9]0
[0-9]of
<ls>BhP. iv, 1, 4of.</ls>
;; should be <ls>BhP. iv, 1, 40 f.</ls>
[0-9]seq : 6 occurrences
;; should be with a space before seq
[0-9]x, [0-9]X
<ls>2 x 33</ls>
;; there is nothing to be <ls> tagged here; should be just 2 x 33
metre of 4x 12
;; metre of 4 x 12
4X 999 syllables
;; 4 x 999 syllables
;; if the x here is to be changed to a multiplication symbol [× (U+00D7)], there are 122 more such places "[0-9] x"
:[0-9]
<ls>BhP. iii:23, 37.</ls>
;; should be <ls>BhP. iii, 23, 37.</ls>
<ls>Hcat. i, 3, 903:3/4</ls>
;; to be changed to <ls>Hcat. i, 3, 903 a/b</ls>
[0-9]:
<ls>RV. ix, 81</ls>,2:
;; to be changed to <ls>RV. ix, 81, 2</ls>:
<s>aparādhaṃ</s> √ 1: <s>kṛ</s>
;; to be changed to <s>aparādhaṃ</s> √ <hom:1.</hom> <s>kṛ</s>
<ab>Vārtt.</ab> 1:
;; to be changed to <ls>Vārtt. </ls> :
<L>69797<pc>377,3<k1>ghuṇ<k2>ghuṇ<e>1 <s>ghuṇ</s> ¦ <ab>cl.</ab> 6. <ab>P.</ab> <s>°ṇati</s>, to go or move about, 48:
;; this should have some Ref. of Dhatup. before 48. (Dhātup. ???, 48 )
<ls>Śak. 7d.</ls>
;; should be <ls>Śak. 7 d.</ls>
<ls>IW. 517n.1</ls>
;; should be <ls>IW. 517, n. 1</ls>
<ls>BhP. x, 4r, 40.</ls>
;; should be <ls>BhP. x, 41, 40.</ls>
Ist or 3rd
;; should be 1st or 3rd
(10thcentury), ‘9handed (?)’, 5jewels
;; space is missing here (10th century, 9 handed, 5 jewels)
"1n" : 8 such places, all before <ls>Kāś.</ls>
;; to be changed to "in"
<ls>Ka1y.</ls>
: 2 places
;; to be changed to <ls>Kaiy.</ls>
I would strongly suggest to have them all in the "English" forms alone
They are not pure English words and need additional markup anyway.
@Andhrabharati Here are comments concluding my analysis of mw_iast_AB_2a.txt.
These 200+ changes are, in my method of analysis, individual (i.e., not done by me via programmatic string or regexp replacements). iast_ab_changes.txt
After applying the regular expression changes, and these individual changes, to mw_iast.txt, I am left with a version equivalent to mw_iast_AB_2a.txt.
Other users should take a look at this file -- you can see the level of detail that AB has brought to his task. Kudos to AB!
During examination of the 200+ changes mentioned above, I noted a handful (16) of further changes which I think should be made in mw_iast_AB_2a.txt before proceeding with a next version. ab_iast_changes.txt
@Andhrabharati please take a look and make these changes in your tab-form file.
I suggest making these changes directly in mw_iast_AB_2a.txt and then doing
git add mw_iast_AB_2a.txt
git commit -m "changes per ab_iast_changes.txt. Ref: https://github.com/sanskrit-lexicon/MWS/issues/96"
git push
This will put the revised mw_iast_AB_2a.txt in the repository at Github. We can take this revised mw_iast_AB_2a.txt as the base for further work.
Regarding AB's §8-12.
Such change ideas should definitely be examined in detail.
HOWEVER, I strongly suggest NOT NOW.
The issue we are working on now is 'MW supplement fresh look'.
Let's stick to and finish that objective before addressing the other problems!
@Andhrabharati What do you say?
I do agree to tackling the Annexure work first.
https://github.com/sanskrit-lexicon/MWS/issues/96#issuecomment-765978111
Was just mentioning these, to show that many issues are in the Main text as well to be resolved.
And now I am ready with Annexure data.
I would strongly suggest to have them all in the "English" forms alone
They are not pure English words and need additional markup anyway.
Yes, that's why ns tag; this makes the file easier to manually read, with not much of clutter.
AB misc. changes
These 200+ changes are, in my method of analysis, individual (i.e., not done by me via programmatic string or regexp replacements). iast_ab_changes.txt
Other users should take a look at this file -- you can see the level of detail that AB has brought to his task. Kudos to AB!
yes, these 200+ are to be individually done; no other way.
@funderburkjim
I am using Github Desktop; but the end result (server ⇄ local systems' interaction) is the same with Gitbash or Github Desktop, I guess.
I would strongly suggest to have them all in the "English" forms alone
They are not pure English words and need additional markup anyway.
Yes, that's why ns tag; this makes the file easier to manually read, with not much of clutter.
And it may be noted that some 2500 <ns>
are still in the data now; not fully replaced with this slp1 stuff!!
Gone through the two "changes files" posted by Jim and did corrections in my AB_2a file.
And here are my updated "changes files" (where all the concurred entries were removed, to save time in gleaning for essential matter) from @funderburkjim -
At the outset, it appears that except the marking and revision remarks I mentioned, all others were considered well and incorporated by @funderburkjim, though very few were not done for some reason. All these could be seen in my updated changes files above.
Finally did another small correction in my AB_2a file- added spaces on either side of + character, wherever not there (about 50 places of them).
Now the file is being updated at the repository, for further comments.
[I would like to hear a point from Jim- about whether he has not agreed for the marking and revision updates, or has kept them on hold for a future date, so that I can plan for my further work.]
My comments added to the 2 files prepared above by AB:
iast_ab_changes_updated_jim.txt
ab_iast_changes_updated_jim.txt
AB made additional changes in revised mw_iast_AB_2a.txt.
Request @Andhrabharati to make these changes in current mw_iast_AB_2a.txt, then push the changes back up to github.
Then we will be done with this first phase.
whether he has not agreed for the marking and revision updates, or has kept them on hold for a future date
Once you make the 5 changes above to mw_iast_AB_2a.txt, I will
Glad to hear that the first "update" is going to happen next, after the above works are attended by me.
Sure, will be doing the Annexure part starting with simple corrections; followed by additions and finally the bigger corrections/revisions.
On a second thought, why not we do only the mergers first (or re-looking at non L-beginning or the non-div lines), for the 2500+ lines that were listed in my first point? This would make our reworking on the files (at either side) better. I just got carried away into further mergings, looking at your 3rd phase of corrections for line mergings, at the first entry 'ced'!!
Anyway now I am happy that I have become a "real" team mate with you; it has been nice seeing your way of working.
2500+ lines
I see at your comment §1(b), you mentioned lines starting with a space. There are now no such lines. So I am not sure what your previous comment refers to.
Regarding 'mergers' -- if you are referring to correcting the <info n="rev"/>
lines, fine with
me to work on that first. If you are thinking of something else by "mergers" (such as what you
did with ced
), please explain further.
Similarly glad to have your insights -- they are improving the mw.txt digitization.
By mergers, I am referring to the 2500+ lines with $ in my file (which are the space etc. lines in the Koeln file).
This 'ced' entry happened to be one such!!
In my language, the "rev" lines will be revisions (with Annexure data), as in your terminology.
; Case 008: L=74913.2, k1=ced, pc=401,3 ; Merge 24 following lines into this one line ; Make numerous spacing changes, also some changes of ';' to ',' ; One substantive change. Pan 3-1,30 reference changed to Pan 8-1,30 reference. ;; this is what I have suggested doing, to re-look at those 2500+ lines (200+ entries in my file) and do necessary corrections & split where necessary. ;; these are all "corrections" as per the book, not "changes"!
I have updated the AB_2a file with the 5 corrections suggested by Jim.
And I will start "working" with the Annexure data now. (The "line-mergers" can happen later!)
@funderburkjim What should my next updated file be named, continue as 2a or make it as 3?
I just opened the file to start working, and got reminded of my unanswered query here- https://github.com/sanskrit-lexicon/MWS/issues/83#issuecomment-757922415
and a "reminder" for the same here- https://github.com/sanskrit-lexicon/MWS/issues/83#issuecomment-763305415
So having gone through the files Jim has been posting all these days, I have decided to do this way-
(a) keep the existing text as 1st line.
[identified as <L> lines
]
(b) keep the proposed way how it should be (possibly many times without full tagging) in the 2nd line.
[I go for integrating in the "interpreted style"- the option (1) in my above referred posting]
[identified as <Ls> lines
]
(c) give my remarks in the following line(s), including the additional tagging that would be required in the proposed line. [identified as ;; lines]
And leave the rest of work [reviewing the lines, undertaking the marking(s)/tagging(s) or not, and finally updating the Koeln file] to @funderburkjim, as he is the final authority to decide these.
Will be starting the work with a fresh mind tomorrow.
I am using Github Desktop; but the end result (server ⇄ local systems' interaction) is the same with Gitbash or Github Desktop, I guess.
So do I.
After AB's changes, this commit (https://github.com/sanskrit-lexicon/MWS/commit/550abf1793c1c8224c2f4d804242ccb4256f64a9) of mw_iast_AB_2a.txt now the basis for csl-orig/v02/mw.txt (at this commit).
git tells us : 83535 insertions(+), 83581 deletions(-) for mw.txt.
In other words, about 10% of the lines of mw.txt were changed in this commit of mw.txt.
Within text of mw.txt delimited by tag <s>...</s>
, text is interpreted as slp1. In particular,
a textual period is interpreted as danda.
As example, consider under headword 'iti' this phrase as coded in mw.txt:
<s>ijyA<srs/>DyayanadAnAni tapaH satyaM kzamA damaH . aloBa i/ti mArgo 'yam</s>
Or, rendered as IAST, we still retain the period:
<s>ijyā<srs/>dhyayanadānāni tapaḥ satyaṃ kṣamā damaḥ . alobha íti mārgo 'yam</s>
When viewed in a display with Devanagari output, the result now has danda:
इज्याध्ययनदानानि तपः सत्यं क्षमा दमः । अलोभ इति मार्गो ऽयम्
; these are all "corrections" as per the book, not "changes"!
I'll take your word for this one!
I understand that you can defer work on the mergers until the annexure changes are in place. Deferring is good.
Here are some preliminary thoughts regarding merging that might be useful when we take up the merging corrections later:
Thinking of the 'ced' example, your changes can be viewed as of two types.
The first type of change/correction is definitely desirable.
I think it can be done independently and preliminary to the second type.
Even where a correction goes over two (current) lines, text can be moved from one line to another to make the correction, perhaps leaving one of the lines (temporarily) blank.
We might have to consider (at this correction stage) whether a given line should or should not end with a space).
As to the second type (merging), there is one reason I don't want to do it in general: line-length.
In the mw.txt digitization, it is desireable to me to have modest line length.
If a small change later needs to be made in a line, it is easier to visually identify where the change is made if the total length of the line is shorter.
There are some good reasons for some line merging, mainly so that each line is logically complete.
So there is some tradeoff between readability and logicality -- we would need to develop some principles to guide the merging.
Improvements in logicality of lines has little effect on displays, since the xml form (mw.xml) used by displays always merges all the lines.
However, logicality of the lines in mw.txt could have an indirect advantage of making it more possible to do data mining from the digitization.
For example, a long-time wish is to mine the verb entries of mw for all the many verb forms.
But the current state of mw.txt for verbs makes a data-mining program unfeasible.
So there is some tradeoff between readability and logicality -- we would need to develop some principles to guide the merging.
We can have a br or div break here, can't we?
@funderburkjim If you elaborate the "verb entries" wish a little more, probably I can tell how to make that wish realised soon!!
Sometimes a different mind (person) gets a solution faster! (fresh mind = fresh idea)
We can have a br or div break here, can't we?
Possibly such markup could be useful. Let's focus on annexure now.
In particular, a textual period is interpreted as danda.
Within text of mw.txt delimited by tag <s>...</s>
, good that not everywhere.
In the mw.txt digitization, it is desireable to me to have modest line length. If a small change later needs to be made in a line, it is easier to visually identify where the change is made if the total length of the line is shorter.
I use Wrap by Windows
mode or more common Wrap by Characters
in EmEditor
For example, a long-time wish is to mine the verb entries of mw for all the many verb forms. But the current state of mw.txt for verbs makes a data-mining program unfeasible.
7 years as of now.
@gasyoun would you explain more about this wish about verbal "data mining"? (even by a personal mail, if you do not like to reiterate here)
I guess, you are more interested in this "verb" topic than any one else (so far), if I understood the events/postings spread across this forum correctly.
If a small change later needs to be made in a line, it is easier to visually identify where the change is made if the total length of the line is shorter.
I use
Wrap by Windows
mode or more commonWrap by Characters
in EmEditor
@gasyoun the point @funderburkjim says is NOT about wrapping to "have the lines to look shorter", but about "locating" a particular string- which is easier in 5-10 lines of wrapped text than in over 50-60 wrapped lines VISUALLY (i.e. not by machine "find")! I do agree with him on this point.
verbal "data mining"
Yes, me and @Shalu411 are the biggest dhātu fans around, see Jim's Mapping of verbs to MW entries
https://sanskrit-lexicon.github.io/verbs/verbs01/verbs1_merge1_1vmw.html
I've seen this before.
I am more interested to know what "the unfulfilled wish" is about.
As Jim himself has done this, his wording "unfeasible" probably refers to something else.
Hence my enquiry about it.
This issue continues #83.
The changes begin!
@Andhrabharati
Suggest you
git add
,git commit -m "...."
, git push as a trial run.Once we're sure the git process works,
suggest you commit and push often, so we can comfortably follow your changes.