sanskrit-lexicon / MWS

Monier Monier-Williams, Sir; A Sanskrit-English dictionary. Oxford, 1899
Other
7 stars 5 forks source link

Hindi language markup #65

Closed gasyoun closed 2 months ago

gasyoun commented 5 years ago

Hindi is mentioned 18 times in MW. But this quote has no markup in 180704.

<p><b>In_Hindi_this_root_often_means_<quote>_to_begin._</quote>

Andhrabharati commented 1 year ago

I had tagged all the languages (whether European, Asian or Indic) appropriately in my review working file; but @funderburkjim seems not that interested in any such markups.

funderburkjim commented 1 year ago

Probably AB's markup includes <lang>Hindī</lang> ....
(Incidentally, I only find two instance of Hindi in mw.txt. These can also be found via Advanced Search Display for MW.)

This particular markup (where lang tag is used without an 'n' attribute) is not valid relative to current mw.dtd; so allowing this markup in cdsl requires associated changes elsewhere in the codebase.

I am not currently interested in duplicating this (and related) markup changes in current mw.txt and associated displays.

I don't have a definite opinion yet on how these markup changes fit together and, what benefits they provide.

There is some discussion of 'lang' tag and 'cog' tag (and others) at https://github.com/sanskrit-lexicon/MWS/issues/153.
AB's work can be followed at https://github.com/sanskrit-lexicon/mw-dev/ repository.

gasyoun commented 1 year ago

I don't have a definite opinion yet on how these markup changes fit together and, what benefits they provide.

It would make possible different additional indexes to MW.

Andhrabharati commented 3 months ago

@funderburkjim / @gasyoun,

Is this issue closable now?

funderburkjim commented 3 months ago

I had tagged all the languages (whether European, Asian or Indic) appropriately in my review working file;

@Andhrabharati Is it time for you to share this lang-tagging portion of your work, and for us (probably me) to find a way to make use of it?

Andhrabharati commented 3 months ago

Here is the list of <lang> tagged words in my file--

lang tags.txt

funderburkjim commented 3 months ago

taking up the lang tags.

Current situation in MW for lang tag. There are no instances of the form <lang>X</lang>

The displays of these forms shows Y; no tooltips involved.

The examples AB's lang.tags file are of the form <lang>X</lang>, e.g. <lang>Gk.</lang>. The current markup is, for example, <ab>Gk.</ab>, and there is a tooltip in mwab_input.txt Gk. <id>Gk.</id> <disp>Greek</disp>.

If we change <ab>Gk.</ab> to <lang>Gk.</lang>, then the markup will be more informative. We can, in MW as some other dictionaries, have the display programs generate a tooltip for <lang>Gk.</lang> using the mwab_input tooltips.

For <lang>English</lang> in lang.tags.txt, the word 'English' is not an abbreviation, so currently does not appear in mwab_input. For uniformity, we could none-the-less add an entry in mwab_input. There are a few others similar to English.

The above outlines a way to make use of lang.tags.txt in cdsl.

@Andhrabharati Anything else I need to take into account before proceeding to implement?

Andhrabharati commented 3 months ago

No, except that you also need to expand the abbr. forms in these tags appropriately.

We can always have the expansions revised/reviewed sometime later.

Andhrabharati commented 3 months ago

Also, I would suggest that counts of the tags be prepared (to compare against my file to fìnd any missings).

Andhrabharati commented 3 months ago

Probably, a quick look at mw-dev issue no. 21 would open up some more options for Jim's revising.

funderburkjim commented 3 months ago

TODO DONE italic greek text in mw

[See this comment below for a solution.]

In reviewing #21, noticed that in current display, the Greek text appears NOT to be marked with italics. By contrast, from the screen shot of #21, the display at that time (2020 display) appears to show Italic Greek text.

2020 - italics 👍

image

Current - no italic 👎

image
funderburkjim commented 3 months ago

tags that use mwab_input.txt

In any dictionary xxx, <ab>X</ab> markup generates a tooltip based on the xxxab_input.txt table for the dictionary.

In some dictionaries, other tags are converted (at time of display) into 'ab' tags by the basicadjust.php component. Then xxxab_input.txt is used for tooltips.

example

dictionary 'gra', entry 'aMsa'. gra.txt (and gra.xml): <lang>go.</lang> graab_input.txt go. <id>go.</id> <disp>Gothic</disp> display shows 'Gothic' tooltip:

image

The plan is to use the same model for tooltips of the <lang>X</lang> tags in mw.txt.

It is possible that some values of X for a dictionary xxx could occur in xxx.txt as both <ab>X</ab> (general abbreviation) and <lang>X</lang> (a language abbreviation) and that the 'expansion' (tooltip) would be different for the two markups.

This situation is expected to be rare. If it occurs the 'local abbreviation' markup could be used, or the tooltip in xxxab_input.txt could mention both possibilities.

gasyoun commented 3 months ago

This situation is expected to be rare.

Exactly

funderburkjim commented 2 months ago

lang tag counts

Changes made to working version of cdsl mw.txt regarding the 'lang.tags.txt' file of AB.
@Andhrabharati The cdsl counts are lang_tags_count.txt.

Note that (4) have count = 0 (not found in cdsl mw.txt) - I couldn't find these.

When AB reviews, next step is revision of mwab_input.txt to provide tooltips for the <lang>x</lang> elements.

Andhrabharati commented 2 months ago

Here is the quick response reg. the not-found tag places: [I shall look at the other tags counts sometime tomorrow.]

image

line-592437 image

line-669216 image

line-345151 image

line-312487 image

line-156840 image

It is noted in my revision work that the MW text has been 'altered' at too many places wrt the print, (a) expansions done directly, (b) punctuation marks skipped or changed, (c) sequence of items around the tag places (esp. the <lex>f.</lex>) interchanged, ... ... ...

[I feel bad to say this, but it appears that more damage is done to the MW text than improvement during last 20-25 years, and the web is abounding with its copies all-around!!]

Andhrabharati commented 2 months ago

Here is a comparison of counts wrt to my present version-- lang_tags_count (CDSL vs AB).txt

There are 39 tags differing in counts, after the above 4 are excluded.

funderburkjim commented 2 months ago

@Andhrabharati A copy of your version of mw is needed for me to resolve the count differences.

Andhrabharati commented 2 months ago

My version has TOO many differences (incl. major format changes), to do any comparison by others.

Pl. wait for a day, to let me check it myself.

funderburkjim commented 2 months ago

greek text italics in mw

The above two commits accomplish this.

image
Andhrabharati commented 2 months ago

Here is the result of a major check done against the lang tag count differences-- lang_tags_checking (CDSL vs AB).txt

I have now separated Sanskṛt-Tibetan and Sanskṛt-Persian as different languages (Sanskṛt,Tibetan and Persian), because these are not composite (derived) languages as Anglo-Saxon and Combro-Briton.

A few tags (Prākṛt, Ved., Lat., Zd.) are yet to be checked fully, and now I would like to ask Jim to post his 'marked' file once looking at this file. [It is easier for me to compare and check with my revised version data.]

Finally I would like to mention that a full-stop normally would have a following space in the abbr. etc.; but when the latter word is connected with a hyphen to the former word there shouldn't be a space after the full-stop. That's the reason for having A.-S., A.S. etc. but not O.H.G.

funderburkjim commented 2 months ago

temp_mw_5.zip

Here are the remaining differences in counts (based on AB previous work) (cdslrev is based on temp_mw_5, and excludes the 'duplicated group entry' lines)

TAG     CDSL    CDSLREV AB      CDSLREV-AB
<lang>Apabhraṃśa</lang> 7       6       5       1
<lang>Class.</lang>     35      33      32      1
<lang>Germ.</lang>      218     210     209     1
<lang>Lat.</lang>       539     526     523     3
<lang>Prākṛ.</lang>     3       11      9       2
<lang>Prākṛt</lang>     233     227     231     -4
<lang>Ved.</lang>       634     632     604     28
<lang>Zd.</lang>        149     140     139     1
<lang>Zend</lang>       12      20      21      -1

wrong line numbers in lang_tags_checking.CDSL.vs.AB.txt
  342487 <L>101804<pc>519,1<k1>DUsara<k2>DUsara<e>1B
  347649 <LEND>

line 505970  <L>150486<pc>755,3<k1>BAzA   to mark as lang? Avanti

<lang>W.</lang> -> <ls>W.</ls> (WILSON) ? (karkara)  Otherwise, what is W.?
Andhrabharati commented 2 months ago
W. -> W. (WILSON) ? (karkara) Otherwise, what is W.?

It is the Welsh language in which the word 'careg' means 'stone'.

Andhrabharati commented 2 months ago

Here are the resolutions of diff. counts-- resolving lang diff. counts.txt

Also, I had changed the <ab>ep.</ab> to <lang>ep.</lang>, as it stands for the "Epic Sanskrit" language.

[Note: In a school of thought, the Skt. language is divided as (a) Vedic, (b) Brahmanic (and Upanishadic), (c) Epic, (d) classic and (e) later period types.]

Andhrabharati commented 2 months ago

I recall mentioning long back, while talking about my <lang and <cog tagging, about <s> tags to be easily identified as non-Skt. language terms near the <lang tags.

Here are three such cases now in the cdsl mw.txt

(156672): ¦ [<ab>cf.</ab> <lang>Gk.</lang> <lang n="greek">καρκίνος</lang>; <lang>Lat.</lang> <s>cancer</s>.]<info lex="inh"/>

;; <s> tag to be changed as <etym> tag as per current cdsl way.

(263223): <s>jarA/yu</s> ¦ <lex>n.</lex> the cast-off skin of a serpent, <lang n="greek">γῆρας</lang> <s>pas</s>, <ls>AV. i, 27, 1</ls><info lex="n"/>

;; <s>pas</s> to be deleted

(566277): <hom>1.</hom> <s>ya</s> ¦ the 1st semivowel (corresponding to the vowels <s>i</s> and <s>I</s>, and having the sound of the <lang>English</lang> <s>y</s>, in Bengal usually pronounced <s>j</s>).

;; <s>y</s> to be changed as <i>y</i> and <s>j</s> as <i>j</i> as they are not belonging to Skt.

funderburkjim commented 2 months ago

resolving.lang.diff.counts.txt file does not explain all lang diffs. E.g. for Apabhraṃśa,

<lang>Apabhraṃśa</lang> 6   5   1
;; line 703494 deleted (being a dupl. entry) in AB version
jim: 7 matches for "<lang>Apabhraṃśa</lang>" in buffer: temp_mw_5.txt
    703494 AB-deleted.  So Jim still has 1 more

Thus, to get at all the lang differences, we need more than counts. Here's an idea how.

Assumption: the AB mw.txt file and CDSL mw.txt file have the same number of lines and the lines of a given line-number correspond (although AB version has 'marked as 'deleted' some entries (see ab_lnums_del.txt note below).

From AB's MW file, AB can construct a file with 2 columns (let's say the filename is lnum_lang_ab.txt)

Note: for a given lnum, there often will be multiple lines in the lnum_lang file, one for each <lang>X</lang> and the order within the lnum_lang file should be same as in the line 'lnum' of mw.txt.

Jim can also construct such a file (lnum_lang_cdsl.txt) using cdsl file (e.g. temp_mw_5) And then Jim can remove the lines whose lnums belong to the AB 'deleted' lines= (ab_lnums_del.txt, thereby constructing lnum_lang_cdsl1.txt.

When the AB version and cdsl versions (re lang tag) are identical, these two files (lnum_lang_ab.txt and lnum_lang_cdsl1.txt) will be identical.

And when these two files are different (as now), the diff will point to the precise differences.

@Andhrabharati What do you think? If you construct the AB file, I'll construct the CDSL file and analyze the diffs, making changes to cdsl as needed.

Andhrabharati commented 2 months ago

Pl. note that with the addition of Slav. at 558566 will make the count as 114 in cdsl. There is a deletion of dupl. entry at 352669 (in AB file) to resolve the same.

line 505970 150486755,3BAzA to mark as lang? Avanti

Yes, it is to be made as <lang>Avantī</lang>

wrong line numbers in lang_tags_checking.CDSL.vs.AB.txt 342487 101804519,1DUsaraDUsara1B 347649

342487 > 312487 ;; noted that cdsl has the intended correction of E. there now. 347649 > 347349 ;; noted that AB has the intended correction of Prākṛ. there.


<lang>Apabhraṃśa</lang> 6 5 1 ;; line 703494 deleted (being a dupl. entry) in AB version

to be read as <lang>Apabhraṃśa</lang> 7 6 1 ;; line 703494 deleted (being a dupl. entry) in AB version

Andhrabharati commented 2 months ago

When the AB version and cdsl versions (re lang tag) are identical, these two files (lnum_lang_ab.txt and lnum_lang_cdsl1.txt) will be identical.

And when these two files are different (as now), the diff will point to the precise differences.

@Andhrabharati What do you think? If you construct the AB file, I'll construct the CDSL file and analyze the diffs, making changes to cdsl as needed.

Yes @funderburkjim , this indeed is a great idea; and can be used in future comparisons.

Andhrabharati commented 2 months ago

Final changes suggested--

<lang>Saṃskṛt</lang> -> <lang>Sanskṛt</lang> at 7 places
<lang>Sanskṛt-Persian</lang> ;; <lang>Sanskṛt</lang>-<lang>Persian</lang> at 413630
<lang>Sanskṛt-Tibetan</lang> ;; <lang>Sanskṛt</lang>-<lang>Tibetan</lang> at 540128
<lang>Prākr.</lang> -> <lang>Prākṛ.</lang> at 354182

Pl. consider changing the above in cdsl.

funderburkjim commented 2 months ago

I intend to construct lnum_lang_cdsl1.txt tomorrow and upload. Hope you will post lnum_lang_ab.txt.

Andhrabharati commented 2 months ago

I had checked the temp_mw_5 file for the corrections mentioned in my above posts and files, and noted that

  1. Māgadhī in line 519281 is tagged as <s1, but to be done in line 207634 as well.
  2. Avantī in line 63845 is wrongly tagged as <lang, instead of in line 505970 (as suggested above).
  3. Jim appears yet to consider the posts 1 and 2 above [so I am not talking about these again].

I have spent some time in analyzing the two versions, and here is the resulting file-- lnum_lang analysis (CDSL vs AB).txt

Hope this would enable Jim to do the final corrections in the cdsl version.

------------------------------------ In case, Jim decides to change the 345 instances of <ab>ep.</ab> in the cdsl version, they would correspond to 328 instances of <lang>ep.</lang> in AB version (the rest falling in the deleted dup. entries); we don't have to spend time again in cross-checking those.

Andhrabharati commented 2 months ago

I intend to construct lnum_lang_cdsl1.txt tomorrow and upload. Hope you will post lnum_lang_ab.txt.

Sorry for doing some extra work than you asked in my above file; hope you'd find the same useful enough to "derive" what you were aiming at.

funderburkjim commented 2 months ago

temp_mw_6.zip

This resolves all lang markup differences, based on lnum_lang analysis (CDSL vs AB).txt

change_mw_5_6.txt documents the changes made in 28 lines.


temp_mw_7.txt implements several odds and ends suggested by AB in recent comments. Details in issue65/readme.txt at cp temp_mw_6.txt temp_mw_7.txt ff. (includes <lang>ep.</lang>, and a few others) Request @Andhrabharati to mention if I've overlooked anything.


I think the <lang> markup now resolved in cdsl version, and consistent with the hidden AB version.

My next step will be to revise the tooltip file (mwab_input.txt in csl-pywork) in light of the \ element revisions.

Andhrabharati commented 2 months ago
  • 3 exceptions at [lla/diffgroups_lla3.txt] Request AB change 'Class.' to 'class.' in his mw.txt for these three cases.

DONE.

temp_mw_7.txt implements several odds and ends suggested by AB in recent comments. Details in [issue65/readme.txt] at cp temp_mw_6.txt temp_mw_7.txt ff. (includes <lang>ep.</lang>, and a few others)

These details are not seen in the readme file.

Request @Andhrabharati to mention if I've overlooked anything.

Pl. see the points 1 & 3 at this post

And finally, are you implementing this?-

04-10-2024 post zip of temp_mw_5.txt to issue at AB request.


TODO: https://github.com/sanskrit-lexicon/mw-dev/issues/21 AB suggests using '<gk>X</gk>' instead of <lang n="Greek">x</lang> Similarly for Arabic , <ar>

funderburkjim commented 2 months ago

cp temp_mw_6.txt temp_mw_7.txt

This now found in the readme.txt file.

funderburkjim commented 2 months ago

temp_mwab_input_2_lang.txt

These are the current tooltips (with counts) for the 112 lang abbreviations in temp_mw_7.txt. @Andhrabharati request you edit these and improve the tips.


@gasyoun and @drdhaval2785 - if you happen to notice this, you also might have some good suggestions for improving the language tooltips

Andhrabharati commented 2 months ago

(CDSL): 254101 old Prākṛt (AB): 254101 Old Prākṛt

(CDSL): 254116 old Prākṛt (AB): 254116 ---------- ;; deleted (dup. entry)

Though Jim has changed "old Prākṛt" to "Old Prākṛt" at 254101, he forgot to do the same at 254116!

And, here is the updated mw_input_2_lang file--

mwab_input_2_lang (AB).txt

Andhrabharati commented 2 months ago

@funderburkjim

You might consider making these changes also--

54627:  <lang n="Hindūstānī">Hind.</lang> ;; with this, **Hind.** stands just for **Hindī** (in the mwab list)

413624: <lang n="Persian">P°</lang>

413633: <lang n="Persian">P°</lang>
413633: <lang n="Sanskṛt">S°</lang>

413636: <lang n="Persian">P°</lang>
413636: <lang n="Sanskṛt">S°</lang>

413639: <lang n="Persian">P°</lang>
413639: <lang n="Sanskṛt">S°</lang>
413639: <lang>Arab</lang>
funderburkjim commented 2 months ago

temp_mw_9.zip

This version has all changes since version 7. There are accompanying mw change files change_7_8.txt and change_8_9.txt.

I aimed to make use of

Everything installed at cologne server.

Perhaps this issue is now closeable? Will wait for @Andhrabharati review.

Andhrabharati commented 2 months ago

met. <disp>?</disp> <count>ab,30</count> ;; met. stands for metaphorically p.p.p. <disp>past participle</disp> <count>ab,1</count> ;; p.p.p. stands for perfect passive participle

And some plural forms like ss.vv., qq.vv., qq.v. need corrections, as they are having the singular form expansions only now. qq.vv. = qq.v. = quae vide ss.vv. = sub vocibus

Also noted that at many places in mw.txt, the comp. stands for compar. and not for "compound".

It appears that all necessary changes are done now, pertaining to this issue; so this issue can be closed now.

funderburkjim commented 2 months ago

Revise mwab_input per previous comment.

Incidental revisions to mw ( ,; ). see change_9_10.txt .

Closing issue. Thanks, @Andhrabharati !

gasyoun commented 2 months ago

might have some good suggestions for improving the language tooltips

@funderburkjim thanks for reminding. I do have some.

Technical first.

1) Old HGerm. High German lang,1 --> Old HGerm. Old High German lang,1

2) Sl. Slavonic or Slavonian. lang,3 Slav. Slavonic or Slavonian lang,113 1st Slavonic or Slavonian should be without a dot at the end?

3) do not understand why Irish remains empty Irish ? lang,2

4) @Andhrabharati are there such Indian prakrits? Avantī ? lang,1 Prācyā ? lang,1 Tulu ? lang,1 Śākārī ? lang,1

5) prakrits why with ? Apabhraṃśa ? lang,7 Hindūstānī ? lang,2 Jaina Prākṛt ? lang,5 Kanarese ? lang,1 Mahratta ? lang,1 Mahrattī ? lang,1 Paiśācī ? lang,1 Pāli ? lang,32

Mahrattī - Indic language spoken mainly in the western Indian state of Maharashtra, from Maratha (= Mahratta). Believe we can add additional hints in the display?

6) stats Should there be a place where the total of

Class. Classical Sanskṛt lang,34 class. Classical Sanskṛt lang,8 class. Sanskṛt Classical Sanskṛt lang,6 Class. Sanskṛt Classical Sanskṛt lang,1 classical Sanskṛt Classical Sanskṛt lang,17

or

Angl.-Sax. Anglo-Saxon lang,2 Angl.S. Anglo-Saxon lang,1 Angl.Sax. Anglo-Saxon lang,135

slight variations is counted? 2+1+135?

Andhrabharati commented 2 months ago

Tulu is not a Prakrt variant. It is one of the 7 major Dravidian languages (having own scripts).

Andhrabharati commented 2 months ago

@funderburkjim

Just noticed that my file had lagauge (instead of language) at 5 places, which is taken by you "as is".

funderburkjim commented 2 months ago

Just noticed that my file had lagauge (instead of language)

@Andhrabharati I find no lagauge in current mw

funderburkjim commented 2 months ago

might have some good suggestions for improving the language tooltips

@gasyoun I believe your comments are based on an outdated version of mwab_input.txt The only item which I changed is the one for the period in Sl.

You may want to review again using current mwab_input.txt.


Regarding your comment at 6. stats :
mwab_tipcount.txt is an aggregation on 'tooltip'. This is csv file with TAB as separator. It has 311 lines vs. the 422 lines of mwab_input.txt.

Andhrabharati commented 2 months ago

Just noticed that my file had lagauge (instead of language)

@Andhrabharati I find no lagauge in current mw

They are in the mwab_input file.

funderburkjim commented 2 months ago

They are in the mwab_input file.

got it. Done.