MW supplement fresh look, part 3

funderburkjim commented 3 years ago

This issue continues #83.

The changes begin!

@Andhrabharati

have synced mwtranscode/mw.txt with csl-orig
have recreated mwtranscode/mw_iast.txt
- The \u... problem is resolved.
Have added you to the corrections team, so you should be able to push to sanskrit-lexicon/mws repository.

Suggest you

git clone the mws repository
make a small number of changes to mw_iast.txt
Then git add, git commit -m "....", git push as a trial run.

Once we're sure the git process works,
suggest you commit and push often, so we can comfortably follow your changes.

gasyoun commented 3 years ago

[Sorry that I am progressing further, before the 1st file content is accepted.] [Reminder: §4 and §6 are yet to be taken up]

Thanks, @Andhrabharati

funderburkjim commented 3 years ago

Analysis of mw_iast_AB_2.txt

additional good corrections noted thus far

These are in addition to those mentioned several comments up. Most, if not all, are mentioned in AB's documentation above.

3901 cases: 1 &c.</ls> -> </ls> &c.
4 cases: </s>+ <s> => +
17 cases: </s>√ <s> => √
177 cases: </s> √ <s> => √
205 cases: </s> <s> =>
19 cases: </ls>&c. => </ls> &c.
22 cases: &c.</ls> => </ls> &c.
97 cases: - √ => -√
9 cases: &</ls> => </ls> &

Metaline problems remain

As far as I can tell, the metaline problems mentioned with mw_iast_AB_1.txt have not been corrected in mw_iast_AB_2.txt. The following file shows all cases (900+) where mw_iast_AB_2.txt metaline differs from the original mw_iast.txt. Most, perhaps all, of these differences are inadvertent errors in mw_iast_AB_2.txt.

Highest priority is to correct those metalines in mw_iast_AB_2.txt. @Andhrabharati can you do this?

funderburkjim commented 3 years ago

This is the file mentioned in previous comment:

metaline_problems.txt

funderburkjim commented 3 years ago

Because of the rather large number (900+) of metaline problems, it might be safer to write a program to revert all the metalines in mw_iast_AB_2.txt back to their value in mw_iast.txt. Let me know if you prefer this program approach, and I'll do it tomorrow.

Andhrabharati commented 3 years ago

Now that you are agreeing on the "good corrections", why not you do them programmatically, once I point them? (This was my initial expectation starting with just mentioning the points.)

This would eliminate my manual correction errors and we can close them quickly.

Once these "punctuation related" items are done with, which have a role to play in next phases of processes, the actual "text related" corrections could be started.

Andhrabharati commented 3 years ago

But let me check why these metaline errors remain in my file; it is a matter of highest concern.

gasyoun commented 3 years ago

Once these "punctuation related" items are done with, which have a role to play in next phases of processes, the actual "text related" corrections could be started.

Understood.

Andhrabharati commented 3 years ago

* 205 cases: `</s> <s>` => ` `

I could see only these (and 4 of them), and two are corrected with a comma between; other two are under my ;; comments, that are to be appropriately corrected.

Metaline problems remain

Highest priority is to correct those metalines in mw_iast_AB_2.txt. @Andhrabharati can you do this?

I could trace out the issue. I was doing some macro based bulk replacements on a file in another window, and this had conflicted with the manual find/replace operations here.

I am really sorry that I had wasted your time, due to this error. [Lesson to remember: Never do parallel operations with "single clipboard", esp. when working on bulk texts.]

I had corrected all those meta-lines, except some 4 or 5, which are having my ;; comments.

Now pushing my 2a file with these changes. (And I hope only the intended changes are present in my file.) [Felt bad for my error, hence not attempting more changes this time, though there are plenty in line!]

BTW, there are <page col> tags inside the Koeln data, though not as <pc> (as I said earlier) but as <pcol> [about 1500 of them] which are used to specify the actual (p,c) content. So the <pb> should have been meant only to indicate a page/column break within the entry text (as seen correctly used thus at many places); that's the reason for my comment while doing the annexure data.

Summary: This means where there is no break within the entry text, this <pcol> tag should be used.

funderburkjim commented 3 years ago

@Andhrabharati

Have finished a first analysis of mw_iast_AB_2a.txt.

LOOKS GOOD!

My procedure has been to:

transform mw_iast_AB_2a.txt into a form comparable to mw_iast.txt. (let's call this ab_iast.txt)
- handle the line breaks (with '$' and <div)
- Transfer the comments (lines starting with ; or R( or (O)) to a separate file for later review
Subject mw_iast.txt to a controlled sequence of alterations, resulting in a final form that is the same as ab_iast.txt.

The result currently is that all but one trivial difference is removed. For some reason, you omitted the first line <?xml version="1.0" encoding="UTF-8"?> from mw_iast_ab_2a.txt.

Also, no problems with metalines at all (although I did notice you intentionally corrected a few <pc> values.)

Tomorrow I will review some of the ad-hoc kinds of changes you made -- I doubt if I'll have any complaints, but want to do a thorough examination nonetheless.

So, I think you can go ahead with your next phase, mw_iast_AB_3.txt (based on AB_2a).

Note on 205 cases: </s> <s> => : These are changes made to mw_iast.txt to be in concord with mw_iast_AB_2[a].txt.

funderburkjim commented 3 years ago

why not you do them programmatically, once I point them

I think that is what I'm doing with my controlled sequence of alterations of mw_iast.txt.

I'll document these programmatic steps as time permits.

Andhrabharati commented 3 years ago

I took <pc> in the metaline is to mention the start of the entry, not the end; and <pb> to mention a (p,c) break within the entry text.

Is this understanding correct? I did not find them detailed in the one or two places where the tagging (markup) is described.

Andhrabharati commented 3 years ago

@funderburkjim

The result currently is that all but one trivial difference is removed. For some reason, you omitted the first line <?xml version="1.0" encoding="UTF-8"?> from mw_iast_ab_2a.txt.

This is a not intended one; happened by mistake I guess. I added the line back into the file now.

LOOKS GOOD!

I think that is what I'm doing with my controlled sequence of alterations of mw_iast.txt.

So, I think you can go ahead with your next phase, mw_iast_AB_3.txt (based on AB_2a).

May I ask you to post your updated mw_iast file after your reviews (for me to do next phase of work on it), so that I would also have some feeling of going progressively? (I understand that you're updating the file anyway).

Probably we can work on a diff. title mw_iast_AB.txt, so that no conflicts with mw_iast.txt creep-in till this reaches a sufficiently reasonable position to get back that (mw_iast.txt or mw.txt) name for public release.

Or do you wish to wait till my corrections are "over" - which I don't think would take not anything less than 3-4 months time.

The main reason why I wanted this progressive way, is not to increase the number of files at the repo, with as much size as 50MB each (even if it is of unlimited capacity).

With Github's inherent handling of files with updates (internal to the file), one single file with all changes marked in it should be the way to go, in my opinion; rather two files- one in your format, and one in my format (single lines records).

gasyoun commented 3 years ago

With Github's inherent handling of files with updates (internal to the file), one single file with all changes marked in it should be the way to go, in my opinion; rather two files- one in your format, and one in my format (single lines records).

2 is ok, 20 would be a mess.

funderburkjim commented 3 years ago

@Andhrabharati

In reviewing changes of mw_iast_AB_2a.txt, I am finding a few that I think you need to change.

It will take me a day or two to finish this review and prepare the list of suggested changes.

Please wait until I do this before you post another version.

Andhrabharati commented 3 years ago

Sure @funderburkjim; in fact I am not doing any "active work" on this now, but awaiting your response to my above posting.

And here are two more points in the list (to be corrected)-

§8. The possessive indicator ('s) should not be within the <ls> or <s> strings, but only after them. `'s to be replaced with 's (3 occurrences) & 's with 's `(1 occurrence)

So is the possessive ' within <ls> to be out of <ls> string; within <s> however it needs a closer look, as it is also used as an avagraha (elision) mark.

§9. matching pairs closing quote wrong : 1 line ‘perhaps it is and is not and is not expressible in words' in <L>257473; to be corrected as ‘perhaps it is and is not and is not expressible in words’

) count (closing brace) more than ( count (opening brace) : 12 lines <L>1427.1, <L>6380.2, <L>15163.2, <L>26674.1, <L>32084.1, <L>80435.2, <L>81840, <L>85951.1, <L>129615, <L>135272.1, <L>136734.11, <L>214028.2

) count (closing brace) less than ( count (opening brace) ( count : 13 lines <L>1427.05, <L>6380.1, <L>15163.1, <L>22291.11, <L>37246.2, <L>80435.1, <L>81334.1, <L>119349.1, <L>124485, <L>136733.1, <L>138717.1, <L>139159, <L>206833

> count (closing angle) more than < count (opening angle) : 14 lines <L>39985, <L>67152, <L>74853, <L>75084, <L>87700, <L>90634, <L>96141.2, <L>97263.1, <L>126361, <L>220610, <L>263600, <L>263600.1, <L>263600.2, <L>263600.3

(smart quotes) In <L>9426.2, `the esoteric` should be ‘the esoteric’ [with smart quote marks as everywhere else] In <L>23830.1, `first father' should be ‘first father’ [with smart quote marks as everywhere else]; and this is the only place where apostrophe is used as a closing quote mark!

gasyoun commented 3 years ago

's to be replaced with 's

More than that, ’ instead of '

https://en.wikipedia.org/wiki/List_of_typographical_symbols_and_punctuation_marks

Andhrabharati commented 3 years ago

's to be replaced with 's

More than that, ’ instead of '

That comes somewhat later in my list; presently listing the ones at/in the tags; and esp. the ones not consistent in form wrt others within the text.

Andhrabharati commented 3 years ago

§10. Ellipsis character The 10 places where ... (three dots) or .... (4 dots) occur, those could be replaced with … the ellipsis character, which is the actual mark used in such places (by convention). (<L>7684, <L>12824, <L>16115, <L>30000.2, <L>36736, <L>39674, <L>40786, <L>42778.2, <L>74916)

Note: <L>39674 though has the three dots and couple of + signs in the data, they are not displayed!

Andhrabharati commented 3 years ago

Now a radical point is coming up.

§11. Marking of Sanskṛit & English (or Anglicised Sanskrit) words, as mentioned by MW himself and CONSISTENTLY followed throughout the book.

Sanskṛit words (or letters) are denoted in 4 styles - (i) Main line in Nāgari type, with equivalents in Indo-Italic type[1] (ii) Subordinate line (Under the Nāgari) in thick Indo-Romanic type[1] (iii) Branch line, in thick Indo-Romanic type (iv) Branch line in Indo-Italic type

English (and Anglicised Sanskrit) words are denoted in 3 styles - (i) Normal (non-bold, non-italic) initial cap. letters with or without diacritics, for proper nous. (ii) Normal (non-bold, non-italic) initial cap. letters for plural forms. (iii) Strictly English words like Aryan, Vedic, Brahman [for equivalent आर्य (ārya), वैदिक (vaidika), ब्राह्मण (brāhmaṇa)] etc.

Seen that the such "English" words as mentioned here, are marked in both slp1 notation and another mixed (mostly the <s> type Sanskrit) notation in the data now . [There are over 50k such words on the whole.]

Though this is highly debatable, I would strongly suggest to have them all in the "English" forms alone as in the book and probably put under the <ns> (non-Sanskrit) tagging as was the case earlier. [We have no "moral right" to change the author's style/idea- The marking of words as Sanskrit or English,] (Of course, the changes such as the (sh > ṣ), (ṛi > ṛ) in trasliteration are somewhat alright as per modern/current usage for Sanskrit words, but I would say these also need be as per the book (when they are intended as English words)- the Vishnu, Krishna etc. are too popular spellings to be changed to Viṣṇu and Kṛṣṇa etc.)

Andhrabharati commented 3 years ago

§12. number not having a space or another number on either side (except when followed by st, nd, rd, & th)

[0-9]?

<ls>Kathās. vi, 58 (and 1 32?) </ls>
;; should be <ls>Kathās. vi, 58 (& 132 ?)</ls>
;; all others to have a space before ?

[0-9]c

8ch.
;; should be <ls>Sch.</ls>
<ls>Sāh. iv, 14c/v</ls>
;; should be <ls>Sāh. iv, 14 c/v</ls>
<ls>ib. 11305 0c.</ls>
;; should be <ls>ib. 11305</ls> &c.
<ls>RV. x, 12c</ls>
;; should be <ls>RV. x, 120</ls>
p.802col.2,  p.1118col.1., p.1200col.3
;; all these should have a space between the number and "col."

[0-9]f : 90 occurrences ;; should be with a space before f

[0-9]i, [0-9]I

<i>upp107475iya</i>
;; to be changed to <i>uppāiya</i>
;; all others (8- i and 2- I) should be [0-9]1, not [0-9]i

[0-9]l

<ab>Introd.</ab> 5l;
;; should be 51 not 5l
<ls>MatsyaP. iii, 9l f.</ls>
;; should be 91 not 9l
0law-book
;; no 0 here

[0-9]o, [0-9]O

;; all these (12- 0 and 1- O) should be [0-9]0
[0-9]of
<ls>BhP. iv, 1, 4of.</ls>
;; should be <ls>BhP. iv, 1, 40 f.</ls>

[0-9]seq : 6 occurrences ;; should be with a space before seq

[0-9]x, [0-9]X

<ls>2 x 33</ls>
;; there is nothing to be <ls> tagged here; should be just 2 x 33
metre of 4x 12
;; metre of 4 x 12
4X 999 syllables
;; 4 x 999 syllables
;; if the x here is to be changed to a multiplication symbol [× (U+00D7)], there are 122 more such places "[0-9] x"

:[0-9]

<ls>BhP. iii:23, 37.</ls>
;; should be <ls>BhP. iii, 23, 37.</ls>
<ls>Hcat. i, 3, 903:3/4</ls>
;; to be changed to  <ls>Hcat. i, 3, 903 a/b</ls>

[0-9]:

<ls>RV. ix, 81</ls>,2:
;; to be changed to <ls>RV. ix, 81, 2</ls>:
<s>aparādhaṃ</s> √ 1: <s>kṛ</s>
;; to be changed to <s>aparādhaṃ</s> √ <hom:1.</hom> <s>kṛ</s>
<ab>Vārtt.</ab> 1:
;; to be changed to <ls>Vārtt. </ls> :
<L>69797<pc>377,3<k1>ghuṇ<k2>ghuṇ<e>1   <s>ghuṇ</s> ¦ <ab>cl.</ab> 6. <ab>P.</ab> <s>°ṇati</s>, to go or move about, 48:
;; this should have some Ref. of Dhatup. before 48. (Dhātup. ???, 48 )

<ls>Śak. 7d.</ls>
;; should be <ls>Śak. 7 d.</ls>

<ls>IW. 517n.1</ls>
;; should be <ls>IW. 517, n. 1</ls>

<ls>BhP. x, 4r, 40.</ls>
;; should be <ls>BhP. x, 41, 40.</ls>

Ist or 3rd
;; should be 1st or 3rd

(10thcentury), ‘9handed (?)’, 5jewels ;; space is missing here (10th century, 9 handed, 5 jewels)

"1n" : 8 such places, all before <ls>Kāś.</ls> ;; to be changed to "in"

<ls>Ka1y.</ls> : 2 places ;; to be changed to <ls>Kaiy.</ls>

gasyoun commented 3 years ago

I would strongly suggest to have them all in the "English" forms alone

They are not pure English words and need additional markup anyway.

funderburkjim commented 3 years ago

@Andhrabharati Here are comments concluding my analysis of mw_iast_AB_2a.txt.

AB misc. changes

These 200+ changes are, in my method of analysis, individual (i.e., not done by me via programmatic string or regexp replacements). iast_ab_changes.txt

After applying the regular expression changes, and these individual changes, to mw_iast.txt, I am left with a version equivalent to mw_iast_AB_2a.txt.

Other users should take a look at this file -- you can see the level of detail that AB has brought to his task. Kudos to AB!

AB further changes suggested

During examination of the 200+ changes mentioned above, I noted a handful (16) of further changes which I think should be made in mw_iast_AB_2a.txt before proceeding with a next version. ab_iast_changes.txt

@Andhrabharati please take a look and make these changes in your tab-form file.
I suggest making these changes directly in mw_iast_AB_2a.txt and then doing

git add mw_iast_AB_2a.txt
git commit -m "changes per ab_iast_changes.txt. Ref: https://github.com/sanskrit-lexicon/MWS/issues/96"
git push

This will put the revised mw_iast_AB_2a.txt in the repository at Github. We can take this revised mw_iast_AB_2a.txt as the base for further work.

funderburkjim commented 3 years ago

Regarding AB's §8-12.

Such change ideas should definitely be examined in detail.

HOWEVER, I strongly suggest NOT NOW.

The issue we are working on now is 'MW supplement fresh look'.

Let's stick to and finish that objective before addressing the other problems!

@Andhrabharati What do you say?

Andhrabharati commented 3 years ago

I do agree to tackling the Annexure work first.

https://github.com/sanskrit-lexicon/MWS/issues/96#issuecomment-765978111

Was just mentioning these, to show that many issues are in the Main text as well to be resolved.

Andhrabharati commented 3 years ago

And now I am ready with Annexure data.

Andhrabharati commented 3 years ago

I would strongly suggest to have them all in the "English" forms alone

They are not pure English words and need additional markup anyway.

Yes, that's why ns tag; this makes the file easier to manually read, with not much of clutter.

Andhrabharati commented 3 years ago

AB misc. changes

These 200+ changes are, in my method of analysis, individual (i.e., not done by me via programmatic string or regexp replacements). iast_ab_changes.txt

Other users should take a look at this file -- you can see the level of detail that AB has brought to his task. Kudos to AB!

yes, these 200+ are to be individually done; no other way.

Andhrabharati commented 3 years ago

@funderburkjim

I am using Github Desktop; but the end result (server ⇄ local systems' interaction) is the same with Gitbash or Github Desktop, I guess.

Andhrabharati commented 3 years ago

I would strongly suggest to have them all in the "English" forms alone

They are not pure English words and need additional markup anyway.

Yes, that's why ns tag; this makes the file easier to manually read, with not much of clutter.

And it may be noted that some 2500 <ns> are still in the data now; not fully replaced with this slp1 stuff!!

Andhrabharati commented 3 years ago

Gone through the two "changes files" posted by Jim and did corrections in my AB_2a file.

And here are my updated "changes files" (where all the concurred entries were removed, to save time in gleaning for essential matter) from @funderburkjim -

ab_iast_changes_updated.txt

iast_ab_changes_updated.txt

At the outset, it appears that except the marking and revision remarks I mentioned, all others were considered well and incorporated by @funderburkjim, though very few were not done for some reason. All these could be seen in my updated changes files above.

Finally did another small correction in my AB_2a file- added spaces on either side of + character, wherever not there (about 50 places of them).

Now the file is being updated at the repository, for further comments.

[I would like to hear a point from Jim- about whether he has not agreed for the marking and revision updates, or has kept them on hold for a future date, so that I can plan for my further work.]

funderburkjim commented 3 years ago

processing revised mw_iast_AB_2a.txt

My comments added to the 2 files prepared above by AB:

iast_ab_changes_updated_jim.txt

ab_iast_changes_updated_jim.txt

further changes by AB

AB made additional changes in revised mw_iast_AB_2a.txt.

ab_2a_rev1_changes.txt

5 changes that AB needs to make.

ab_2_changes_edit.txt

Request @Andhrabharati to make these changes in current mw_iast_AB_2a.txt, then push the changes back up to github.

Then we will be done with this first phase.

whether he has not agreed for the marking and revision updates, or has kept them on hold for a future date

Once you make the 5 changes above to mw_iast_AB_2a.txt, I will

commit the changes thus far (i.e. based on my version of the revised mw_iast_AB_2a.txt) to the mw.txt in csl-orig/v02/mw (This is the base digitization of mw).
be ready for you to begin next phase
- I hope your next phase will be restricted to the mwsupplement (the 'annexure').
- I also hope you will do line 'mergings' (like under ced) only when absolutely necessary, as they make my analysis more difficult.

Andhrabharati commented 3 years ago

Glad to hear that the first "update" is going to happen next, after the above works are attended by me.

Sure, will be doing the Annexure part starting with simple corrections; followed by additions and finally the bigger corrections/revisions.

On a second thought, why not we do only the mergers first (or re-looking at non L-beginning or the non-div lines), for the 2500+ lines that were listed in my first point? This would make our reworking on the files (at either side) better. I just got carried away into further mergings, looking at your 3rd phase of corrections for line mergings, at the first entry 'ced'!!

Anyway now I am happy that I have become a "real" team mate with you; it has been nice seeing your way of working.

funderburkjim commented 3 years ago

2500+ lines

I see at your comment §1(b), you mentioned lines starting with a space. There are now no such lines. So I am not sure what your previous comment refers to.

Regarding 'mergers' -- if you are referring to correcting the <info n="rev"/> lines, fine with me to work on that first. If you are thinking of something else by "mergers" (such as what you did with ced), please explain further.

Similarly glad to have your insights -- they are improving the mw.txt digitization.

Andhrabharati commented 3 years ago

By mergers, I am referring to the 2500+ lines with $ in my file (which are the space etc. lines in the Koeln file).

This 'ced' entry happened to be one such!!

In my language, the "rev" lines will be revisions (with Annexure data), as in your terminology.

Andhrabharati commented 3 years ago

; Case 008: L=74913.2, k1=ced, pc=401,3 ; Merge 24 following lines into this one line ; Make numerous spacing changes, also some changes of ';' to ',' ; One substantive change. Pan 3-1,30 reference changed to Pan 8-1,30 reference. ;; this is what I have suggested doing, to re-look at those 2500+ lines (200+ entries in my file) and do necessary corrections & split where necessary. ;; these are all "corrections" as per the book, not "changes"!

I have updated the AB_2a file with the 5 corrections suggested by Jim.

And I will start "working" with the Annexure data now. (The "line-mergers" can happen later!)

@funderburkjim What should my next updated file be named, continue as 2a or make it as 3?

Andhrabharati commented 3 years ago

I just opened the file to start working, and got reminded of my unanswered query here- https://github.com/sanskrit-lexicon/MWS/issues/83#issuecomment-757922415

and a "reminder" for the same here- https://github.com/sanskrit-lexicon/MWS/issues/83#issuecomment-763305415

So having gone through the files Jim has been posting all these days, I have decided to do this way-

(a) keep the existing text as 1st line. [identified as <L> lines]

(b) keep the proposed way how it should be (possibly many times without full tagging) in the 2nd line. [I go for integrating in the "interpreted style"- the option (1) in my above referred posting] [identified as <Ls> lines]

(c) give my remarks in the following line(s), including the additional tagging that would be required in the proposed line. [identified as ;; lines]

And leave the rest of work [reviewing the lines, undertaking the marking(s)/tagging(s) or not, and finally updating the Koeln file] to @funderburkjim, as he is the final authority to decide these.

Will be starting the work with a fresh mind tomorrow.

gasyoun commented 3 years ago

I am using Github Desktop; but the end result (server ⇄ local systems' interaction) is the same with Gitbash or Github Desktop, I guess.

So do I.

funderburkjim commented 3 years ago

mw_iast_AB_2a.txt

After AB's changes, this commit (https://github.com/sanskrit-lexicon/MWS/commit/550abf1793c1c8224c2f4d804242ccb4256f64a9) of mw_iast_AB_2a.txt now the basis for csl-orig/v02/mw.txt (at this commit).

git tells us : 83535 insertions(+), 83581 deletions(-) for mw.txt.

In other words, about 10% of the lines of mw.txt were changed in this commit of mw.txt.

funderburkjim commented 3 years ago

comment on danda

Within text of mw.txt delimited by tag <s>...</s>, text is interpreted as slp1. In particular, a textual period is interpreted as danda.

As example, consider under headword 'iti' this phrase as coded in mw.txt: <s>ijyA<srs/>DyayanadAnAni tapaH satyaM kzamA damaH . aloBa i/ti mArgo 'yam</s>

Or, rendered as IAST, we still retain the period: <s>ijyā<srs/>dhyayanadānāni tapaḥ satyaṃ kṣamā damaḥ . alobha íti mārgo 'yam</s>

When viewed in a display with Devanagari output, the result now has danda:

इज्याध्ययनदानानि तपः सत्यं क्षमा दमः । अलोभ इति मार्गो ऽयम्

funderburkjim commented 3 years ago

; these are all "corrections" as per the book, not "changes"!

I'll take your word for this one!

funderburkjim commented 3 years ago

Regarding "mergers".

I understand that you can defer work on the mergers until the annexure changes are in place. Deferring is good.

Here are some preliminary thoughts regarding merging that might be useful when we take up the merging corrections later:

Thinking of the 'ced' example, your changes can be viewed as of two types.

correcting the text
- spacing
- commas and semicolons where needed
- Other factual corrections (like the Pan reference correction
combining the corrected lines (merging)
- you chose to combine ALL the lines.

The first type of change/correction is definitely desirable.

I think it can be done independently and preliminary to the second type.

Even where a correction goes over two (current) lines, text can be moved from one line to another to make the correction, perhaps leaving one of the lines (temporarily) blank.

We might have to consider (at this correction stage) whether a given line should or should not end with a space).

As to the second type (merging), there is one reason I don't want to do it in general: line-length.

In the mw.txt digitization, it is desireable to me to have modest line length.

If a small change later needs to be made in a line, it is easier to visually identify where the change is made if the total length of the line is shorter.

There are some good reasons for some line merging, mainly so that each line is logically complete.

So there is some tradeoff between readability and logicality -- we would need to develop some principles to guide the merging.

Improvements in logicality of lines has little effect on displays, since the xml form (mw.xml) used by displays always merges all the lines.

However, logicality of the lines in mw.txt could have an indirect advantage of making it more possible to do data mining from the digitization.

For example, a long-time wish is to mine the verb entries of mw for all the many verb forms.
But the current state of mw.txt for verbs makes a data-mining program unfeasible.

Andhrabharati commented 3 years ago

So there is some tradeoff between readability and logicality -- we would need to develop some principles to guide the merging.

We can have a br or div break here, can't we?

Andhrabharati commented 3 years ago

@funderburkjim If you elaborate the "verb entries" wish a little more, probably I can tell how to make that wish realised soon!!

Sometimes a different mind (person) gets a solution faster! (fresh mind = fresh idea)

funderburkjim commented 3 years ago

We can have a br or div break here, can't we?

Possibly such markup could be useful. Let's focus on annexure now.

gasyoun commented 3 years ago

In particular, a textual period is interpreted as danda.

Within text of mw.txt delimited by tag <s>...</s>, good that not everywhere.

In the mw.txt digitization, it is desireable to me to have modest line length. If a small change later needs to be made in a line, it is easier to visually identify where the change is made if the total length of the line is shorter.

I use Wrap by Windows mode or more common Wrap by Characters in EmEditor

lend

For example, a long-time wish is to mine the verb entries of mw for all the many verb forms. But the current state of mw.txt for verbs makes a data-mining program unfeasible.

7 years as of now.

Andhrabharati commented 3 years ago

@gasyoun would you explain more about this wish about verbal "data mining"? (even by a personal mail, if you do not like to reiterate here)

I guess, you are more interested in this "verb" topic than any one else (so far), if I understood the events/postings spread across this forum correctly.

Andhrabharati commented 3 years ago

If a small change later needs to be made in a line, it is easier to visually identify where the change is made if the total length of the line is shorter.

I use Wrap by Windows mode or more common Wrap by Characters in EmEditor

@gasyoun the point @funderburkjim says is NOT about wrapping to "have the lines to look shorter", but about "locating" a particular string- which is easier in 5-10 lines of wrapped text than in over 50-60 wrapped lines VISUALLY (i.e. not by machine "find")! I do agree with him on this point.

gasyoun commented 3 years ago

verbal "data mining"

Yes, me and @Shalu411 are the biggest dhātu fans around, see Jim's Mapping of verbs to MW entries

https://sanskrit-lexicon.github.io/verbs/verbs01/verbs1_merge1_1vmw.html

Andhrabharati commented 3 years ago

I've seen this before.

I am more interested to know what "the unfulfilled wish" is about.

Andhrabharati commented 3 years ago

As Jim himself has done this, his wording "unfeasible" probably refers to something else.

Hence my enquiry about it.

sanskrit-lexicon / MWS