sanskrit-lexicon / mw-dev

Development version of MW dictionary, to collaborate with Andhrabharati
1 stars 0 forks source link

MW full-review-001: Issues wrt the body marker (¦) #2

Open Andhrabharati opened 1 year ago

Andhrabharati commented 1 year ago

The base file is at the end of sanskrit-lexicon/MWS#145 -- https://github.com/sanskrit-lexicon/MWS/issues/145#issuecomment-1365187265

Andhrabharati commented 1 year ago

And the initial observations are at https://github.com/sanskrit-lexicon/MWS/issues/145#issuecomment-1364644309 and https://github.com/sanskrit-lexicon/MWS/issues/145#issuecomment-1364689233

Andhrabharati commented 1 year ago

Do I have write access to the MWS repo to create folders and push the files, as @funderburkjim suggested at https://github.com/sanskrit-lexicon/MWS/issues/145#issuecomment-1364716412?

funderburkjim commented 1 year ago

Sample changes

See sanskrit-lexicon/MWS#145 starting at 'file naming convention' comment. This continues that discussion.

Andhrabharati provides an informal description of what to expect in temp_change_02_iast.txt; See this comment.

temp_change_02_iast.txt shows exactly what is changed. Here are two representative line changes (out of the 276923 lines that were changed).

removal of <info lex=... tags

;---------------------------------------------------
; <L>39<pc>1,1<k1>aṃśaka<k2>aṃśaka<h>1<e>2
150 old <hom>1.</hom> <s>aṃśaka</s> ¦ <lex>mf(<s>ikā</s>)n.</lex> (<ab>ifc.</ab>) forming part.<info lex="m:f#ikA:n"/>
;
150 new <hom>1.</hom> <s>aṃśaka</s> ¦ <lex>mf(<s>ikā</s>)n.</lex> (<ab>ifc.</ab>) forming part.

simplification of <s1> markup

The slp1 translation is removed. (In this example, an <info lex... tag also removed:

; <L>86<pc>1,2<k1>aṃśula<k2>aṃśula<e>2A
291 old ¦ <ab>N.</ab> of the sage <s1 slp1="cARakya">Cāṇakya</s1>, <ls>L.</ls><info lex="inh"/>
;
291 new ¦ <ab>N.</ab> of the sage <s1>Cāṇakya</s1>, <ls>L.</ls>
funderburkjim commented 1 year ago

First impressions

@Andhrabharati did these two changes to make it easier to compare the digitization to the printed text.

But should we 'accept' these revisions in csl-orig/v02/mw/mw.txt ?

I think that these changes would have no impact on current displays of MW at Cologne. [Note I have not fully investigated the impact].

info lex

These were originally constructed from analysis of the <lex>..</lex> tags. e.g., <info lex="m:f#ikA:n"/> was derived from <lex>mf(<s>ikā</s>)n.</lex> in the example above. But there is wild variety in what is inside the <lex>...</lex> tags, which this derivation had to take into account. By contrast, the forms in the lex attribute of the info tags are much more regular and amenable to further use. They have been used in the construction of declensions in 'MW Inflected forms' display of the Sanskrit-lexicon home page.

The <s1 slp1="X">Y</s1> tags were originally constructed as an aid to both analysis and display. The current version of mw displays essentially ignores 'X', just shows the IAST version Y. But earlier versions of mw displays ignored Y and showed 'X' according to the user's output preference; e.g. चाणक्य would be displayed if the user chose Devanagari for Sanskrit output. That display variation is still an interesting and useful idea, in my opinion.

For these reasons, I think that temp_mw_02 should NOT replace csl-orig/v02/mw/mw.txt, because it removes useful markup.

However, I agree that this removal of markup could simplify the daunting task of further comparison of the digitization of MW to the printed MW.

We need to develop some way to resolve these competing interests.

funderburkjim commented 1 year ago

@Andhrabharati See if you can replicate the construction of temp_change_mw_iast_02.txt using Python as described in 'get file of changes' section of sanskrit-lexicon/MWS#145.

funderburkjim commented 1 year ago

Do I have write access to mws

@Andhrabharati Do a simple test.

If permission problems occur, ask @drdhaval2785 to do what is necessary to grant you write permission.

Andhrabharati commented 1 year ago

@funderburkjim / @drdhaval2785

I have created a folder (issue146) and 'pushed' two files into it as samples [which means that I do have a write access], representing almost all varieties of corrections intended. I feel these changes are self-explanatory and do not need further documentation.

Just have a look at these and give your feedback. ------------------------- Tomorrow, I will be writing a few indicative norms that I chose to follow (and update as the work progresses), based on my study of the MW book.

gasyoun commented 1 year ago

I feel these changes are self-explanatory and do not need further documentation.

hope you are not the only one who thinks they do not need further documentation ))

Andhrabharati commented 1 year ago

@Andhrabharati See if you can replicate the construction of temp_change_mw_iast_02.txt using Python as described in 'get file of changes' section of sanskrit-lexicon/MWS#145.

But, WHY?

funderburkjim commented 1 year ago

@Andhrabharati Did you see my comments ?

These mean that what you return needs to include this useful markup.

Here are some specific comments based on your MW-p.1(AB).txt file, based just on the first 4 'entries'

LINE 2  L=1
OLD  <hom>1.</hom> <s>a</s> ¦ the first letter of the alphabet
NEW <hom>1.</hom> <s>a</s>, ¦ the first letter of the alphabet;
OK  added punctuation is fine

LINE 5 L=1.1
OLD ¦ the first short vowel inherent in consonants.
NEW ¦ the first short vowel inherent in consonants.
OK  no difference

LINE 8  L=2
OLD <s>a—kāra</s> ¦ <lex>m.</lex> the letter or sound <s>a</s>.<info lex="m"/>
NEW <s>a—kāra</s>, ¦ <lex>m.</lex> the letter or sound <s>a</s>.
NO  added punctuation is fine.  Removal of <info lex="m"/>  not acceptable
ALT <s>a—kāra</s>, ¦ <lex>m.</lex> the letter or sound <s>a</s>.<info lex="m"/>
    The ALT is acceptable because it restores the info tag.

LINE 11  L=3
OLD <hom>2.</hom> <s>a</s> ¦ (<s>pragṛhya</s>, <ab>q.v.</ab>), a vocative particle [<s>a ananta</s>, O <s1 slp1="vizRu">Viṣṇu</s1>], <ls>T.</ls>
NEW <hom>2.</hom> <s>a</s> ¦ (<s>pragṛhya</s>, <ab>q.v.</ab>), a vocative particle [<s>a ananta</s>, O <s1>Viṣṇu</s1>], <ls>T.</ls>;
NO  added punctuation fine.  Need to restore slp1 attribute.
ALT <hom>2.</hom> <s>a</s> ¦ (<s>pragṛhya</s>, <ab>q.v.</ab>), a vocative particle [<s>a ananta</s>, O <s1 slp1="vizRu">Viṣṇu</s1>], <ls>T.</ls>;
   Restored the slp1 attribute, so ALT is acceptable.

There are other kinds of changes you made (such as splitting L=4.5 into two entries L=4.5 and 4.6. But before considering this and other 'types' of changes than the first 4. the dropped markup problem needs to be addressed and solved.

funderburkjim commented 1 year ago

But Why (do the Python exercise suggested) ?

It might be that specialized Python programs can be useful in bridging the gap between your manner of work and the conventions of CSL. For example, I have read there to be Python modules that can interpret in some ways Excel files. I thought it useful to learn whether you know how use a Windows terminal to run a Python program.

Andhrabharati commented 1 year ago

I thought it useful to learn whether you know how use a Windows terminal to run a Python program.

I know the process and use it occasionally; but I have NO intentions of using my programming knowledge for any of my CDSL works (I had mentioned this sometime back as well, in response to a post by @gasyoun).

Andhrabharati commented 1 year ago

@Andhrabharati Did you see my comments ?

* `We need to develop some way to resolve these competing interests.`

But before considering this and other 'types' of changes than the first 4. the dropped markup problem needs to be addressed and solved.

It appears that you did not 'notice' my posts at https://github.com/sanskrit-lexicon/MWS/issues/145#issuecomment-1364638221 and https://github.com/sanskrit-lexicon/MWS/issues/145#issuecomment-1365187265; I have separated out all the trailing tags (info type) into a file, which could be 'padded' at the resp. line endings once the main text is read and corrected wherever necessary.

And you've a faithful and sincere associate (@AnnaRybakovaT ) who can help regenerating the slp1 strings for all <s1> tags very easily [if you or Dhaval do not want to do it for whatever reason], and the <ab n= tags with some little effort to distinguish between the English and Sanskrit varieties; I guess this was the intention of having the <abE> earlier, which was the point raised by @drdhaval2785 when he was looking at my Lith. file last year (which is still remaining unfinished at his end).

* `I think that temp_mw_02 should NOT replace csl-orig/v02/mw/mw.txt, because it removes useful markup.`

I never suggested replacing mw.txt with my file directly; I was presuming the above two works (and any modifications in <k2>, <h> and <e> fields in the meta-lines) to be done before that.

Andhrabharati commented 1 year ago

There are other kinds of changes you made (such as splitting L=4.5 into two entries L=4.5 and 4.6.

I did look into the text matter carefully and like to keep all the related info in a single line/para; instead of breaking the lines 'blind-folded' at a semicolon, treating it as a 'sense separator' as is the case till now. [One of the reasons @funderburkjim earlier gave for not joining those lines is to keep the line-length to a minimum; but I see that almost 10% (80K) of the 'lines' in the text have lengths of 100 or more characters even now. So I do not understand what prevents him to make another 2-3% like that.]

I was to write the 'guiding norms' for my working today, but seems it is not worth continuing the process any more, with these gross differences in opinions.

funderburkjim commented 1 year ago

I like the quote from the science fiction novel DUNE: "Beginnings are such delicate times." Let us be patient and make efforts to resolve the "gross differences of opinion." Perhaps Dhaval will have some ideas. Your guiding norms might be useful to him. I'll make no further comments on this project for a while, since my comments seem to upset you; which I do not wish to do.