Closed gasyoun closed 2 months ago
I had tagged all the languages (whether European, Asian or Indic) appropriately in my review working file; but @funderburkjim seems not that interested in any such markups.
Probably AB's markup includes <lang>Hindī</lang> ...
.
(Incidentally, I only find two instance of Hindi
in mw.txt. These can also be found via Advanced Search Display for MW.)
This particular markup (where lang tag is used without an 'n' attribute) is not valid relative to current mw.dtd; so allowing this markup in cdsl requires associated changes elsewhere in the codebase.
I am not currently interested in duplicating this (and related) markup changes in current mw.txt and associated displays.
I don't have a definite opinion yet on how these markup changes fit together and, what benefits they provide.
There is some discussion of 'lang' tag and 'cog' tag (and others) at https://github.com/sanskrit-lexicon/MWS/issues/153.
AB's work can be followed at https://github.com/sanskrit-lexicon/mw-dev/ repository.
I don't have a definite opinion yet on how these markup changes fit together and, what benefits they provide.
It would make possible different additional indexes to MW.
@funderburkjim / @gasyoun,
Is this issue closable now?
I had tagged all the languages (whether European, Asian or Indic) appropriately in my review working file;
@Andhrabharati Is it time for you to share this lang-tagging portion of your work, and for us (probably me) to find a way to make use of it?
Here is the list of <lang>
tagged words in my file--
Current situation in MW for lang tag.
There are no instances of the form <lang>X</lang>
<lang n="greek">Y</lang>
1162 instances.<lang script="Arabic" n="Arabic">Y</lang>
56 instances<lang script="Arabic" n="Hindustani">Y</lang>
3 instances<lang script="Arabic" n="Persian">Y</lang>
40 instances<lang script="Arabic" n="Turkish">T</lang>
5 instancesThe displays of these forms shows Y; no tooltips involved.
The examples AB's lang.tags file are of the form <lang>X</lang>
,
e.g. <lang>Gk.</lang>
. The current markup is, for example, <ab>Gk.</ab>
, and there is a tooltip in mwab_input.txt Gk. <id>Gk.</id> <disp>Greek</disp>
.
If we change <ab>Gk.</ab>
to <lang>Gk.</lang>
, then the markup will be more informative. We can, in MW as some other dictionaries, have the display programs generate a tooltip for <lang>Gk.</lang>
using the mwab_input tooltips.
For <lang>English</lang>
in lang.tags.txt, the word 'English' is not an abbreviation, so currently does not appear in mwab_input. For uniformity, we could none-the-less add an entry in mwab_input. There are a few others similar to English
.
The above outlines a way to make use of lang.tags.txt in cdsl.
@Andhrabharati Anything else I need to take into account before proceeding to implement?
No, except that you also need to expand the abbr. forms in these tags appropriately.
We can always have the expansions revised/reviewed sometime later.
Also, I would suggest that counts of the tags be prepared (to compare against my file to fìnd any missings).
Probably, a quick look at mw-dev issue no. 21 would open up some more options for Jim's revising.
[See this comment below for a solution.]
In reviewing #21, noticed that in current display, the Greek text appears NOT to be marked with italics. By contrast, from the screen shot of #21, the display at that time (2020 display) appears to show Italic Greek text.
2020 - italics 👍
Current - no italic 👎
In any dictionary xxx, <ab>X</ab>
markup generates a tooltip based on the xxxab_input.txt table for the dictionary.
In some dictionaries, other tags are converted (at time of display) into 'ab' tags by the basicadjust.php component. Then xxxab_input.txt is used for tooltips.
dictionary 'gra', entry 'aMsa'.
gra.txt (and gra.xml): <lang>go.</lang>
graab_input.txt go. <id>go.</id> <disp>Gothic</disp>
display shows 'Gothic' tooltip:
The plan is to use the same model for tooltips of the <lang>X</lang>
tags in mw.txt.
It is possible that some values of X for a dictionary xxx could occur in xxx.txt as both <ab>X</ab>
(general abbreviation) and <lang>X</lang>
(a language abbreviation) and that the 'expansion' (tooltip) would be different for the two markups.
This situation is expected to be rare. If it occurs the 'local abbreviation' markup could be used, or the tooltip in xxxab_input.txt could mention both possibilities.
This situation is expected to be rare.
Exactly
counts
Changes made to working version of cdsl mw.txt regarding the 'lang.tags.txt' file of AB.
@Andhrabharati The cdsl counts are lang_tags_count.txt.
Note that (4) have count = 0 (not found in cdsl mw.txt) - I couldn't find these.
When AB reviews, next step is revision of mwab_input.txt to provide tooltips for the <lang>x</lang>
elements.
Here is the quick response reg. the not-found tag places: [I shall look at the other tags counts sometime tomorrow.]
line-592437
line-669216
line-345151
line-312487
line-156840
It is noted in my revision work that the MW text has been 'altered' at too many places wrt the print, (a) expansions done directly, (b) punctuation marks skipped or changed, (c) sequence of items around the tag places (esp. the <lex>f.</lex>
) interchanged, ... ... ...
[I feel bad to say this, but it appears that more damage is done to the MW text than improvement during last 20-25 years, and the web is abounding with its copies all-around!!]
Here is a comparison of counts wrt to my present version-- lang_tags_count (CDSL vs AB).txt
There are 39 tags differing in counts, after the above 4 are excluded.
@Andhrabharati A copy of your version of mw is needed for me to resolve the count differences.
My version has TOO many differences (incl. major format changes), to do any comparison by others.
Pl. wait for a day, to let me check it myself.
The above two commits accomplish this.
Here is the result of a major check done against the lang
tag count differences--
lang_tags_checking (CDSL vs AB).txt
I have now separated Sanskṛt-Tibetan and Sanskṛt-Persian as different languages (Sanskṛt,Tibetan and Persian), because these are not composite (derived) languages as Anglo-Saxon and Combro-Briton.
A few tags (Prākṛt, Ved., Lat., Zd.) are yet to be checked fully, and now I would like to ask Jim to post his 'marked' file once looking at this file. [It is easier for me to compare and check with my revised version data.]
Finally I would like to mention that a full-stop normally would have a following space in the abbr. etc.; but when the latter word is connected with a hyphen to the former word there shouldn't be a space after the full-stop. That's the reason for having A.-S., A.S. etc. but not O.H.G.
Here are the remaining differences in counts (based on AB previous work) (cdslrev is based on temp_mw_5, and excludes the 'duplicated group entry' lines)
TAG CDSL CDSLREV AB CDSLREV-AB
<lang>Apabhraṃśa</lang> 7 6 5 1
<lang>Class.</lang> 35 33 32 1
<lang>Germ.</lang> 218 210 209 1
<lang>Lat.</lang> 539 526 523 3
<lang>Prākṛ.</lang> 3 11 9 2
<lang>Prākṛt</lang> 233 227 231 -4
<lang>Ved.</lang> 634 632 604 28
<lang>Zd.</lang> 149 140 139 1
<lang>Zend</lang> 12 20 21 -1
wrong line numbers in lang_tags_checking.CDSL.vs.AB.txt
342487 <L>101804<pc>519,1<k1>DUsara<k2>DUsara<e>1B
347649 <LEND>
line 505970 <L>150486<pc>755,3<k1>BAzA to mark as lang? Avanti
<lang>W.</lang> -> <ls>W.</ls> (WILSON) ? (karkara) Otherwise, what is W.?
W. ->W. (WILSON) ? (karkara) Otherwise, what is W.?
It is the Welsh language in which the word 'careg' means 'stone'.
Here are the resolutions of diff. counts-- resolving lang diff. counts.txt
Also, I had changed the <ab>ep.</ab>
to <lang>ep.</lang>
, as it stands for the "Epic Sanskrit" language.
[Note: In a school of thought, the Skt. language is divided as (a) Vedic, (b) Brahmanic (and Upanishadic), (c) Epic, (d) classic and (e) later period types.]
I recall mentioning long back, while talking about my <lang and <cog tagging, about <s>
tags to be easily identified as non-Skt. language terms near the <lang tags.
Here are three such cases now in the cdsl mw.txt
(156672): ¦ [<ab>cf.</ab> <lang>Gk.</lang> <lang n="greek">καρκίνος</lang>; <lang>Lat.</lang> <s>cancer</s>.]<info lex="inh"/>
;; <s>
tag to be changed as <etym>
tag as per current cdsl way.
(263223): <s>jarA/yu</s> ¦ <lex>n.</lex> the cast-off skin of a serpent, <lang n="greek">γῆρας</lang> <s>pas</s>, <ls>AV. i, 27, 1</ls><info lex="n"/>
;; <s>pas</s>
to be deleted
(566277): <hom>1.</hom> <s>ya</s> ¦ the 1st semivowel (corresponding to the vowels <s>i</s> and <s>I</s>, and having the sound of the <lang>English</lang> <s>y</s>, in Bengal usually pronounced <s>j</s>).
;; <s>y</s>
to be changed as <i>y</i>
and <s>j</s>
as <i>j</i>
as they are not belonging to Skt.
resolving.lang.diff.counts.txt file does not explain all lang diffs. E.g. for Apabhraṃśa,
<lang>Apabhraṃśa</lang> 6 5 1
;; line 703494 deleted (being a dupl. entry) in AB version
jim: 7 matches for "<lang>Apabhraṃśa</lang>" in buffer: temp_mw_5.txt
703494 AB-deleted. So Jim still has 1 more
Thus, to get at all the lang differences, we need more than counts. Here's an idea how.
Assumption: the AB mw.txt file and CDSL mw.txt file have the same number of lines and the lines of a given line-number correspond (although AB version has 'marked as 'deleted' some entries (see ab_lnums_del.txt note below).
From AB's MW file, AB can construct a file with 2 columns (let's say the filename is lnum_lang_ab.txt)
<lang>X</lang>
Note: for a given lnum, there often will be multiple lines in the lnum_lang file, one for each <lang>X</lang>
and the order within the lnum_lang file should be same as in the line 'lnum' of mw.txt.
Jim can also construct such a file (lnum_lang_cdsl.txt) using cdsl file (e.g. temp_mw_5) And then Jim can remove the lines whose lnums belong to the AB 'deleted' lines= (ab_lnums_del.txt, thereby constructing lnum_lang_cdsl1.txt.
When the AB version and cdsl versions (re lang tag) are identical, these two files (lnum_lang_ab.txt and lnum_lang_cdsl1.txt) will be identical.
And when these two files are different (as now), the diff will point to the precise differences.
@Andhrabharati What do you think? If you construct the AB file, I'll construct the CDSL file and analyze the diffs, making changes to cdsl as needed.
Pl. note that with the addition of Slav. at 558566 will make the count as 114 in cdsl. There is a deletion of dupl. entry at 352669 (in AB file) to resolve the same.
line 505970
150486 755,3 BAzA to mark as lang? Avanti
Yes, it is to be made as <lang>Avantī</lang>
wrong line numbers in lang_tags_checking.CDSL.vs.AB.txt 342487
101804 519,1 DUsara DUsara 1B 347649
342487 > 312487 ;; noted that cdsl has the intended correction of E. there now. 347649 > 347349 ;; noted that AB has the intended correction of Prākṛ. there.
<lang>Apabhraṃśa</lang> 6 5 1
;; line 703494 deleted (being a dupl. entry) in AB version
to be read as
<lang>Apabhraṃśa</lang> 7 6 1
;; line 703494 deleted (being a dupl. entry) in AB version
When the AB version and cdsl versions (re lang tag) are identical, these two files (lnum_lang_ab.txt and lnum_lang_cdsl1.txt) will be identical.
And when these two files are different (as now), the diff will point to the precise differences.
@Andhrabharati What do you think? If you construct the AB file, I'll construct the CDSL file and analyze the diffs, making changes to cdsl as needed.
Yes @funderburkjim , this indeed is a great idea; and can be used in future comparisons.
Final changes suggested--
<lang>Saṃskṛt</lang> -> <lang>Sanskṛt</lang> at 7 places
<lang>Sanskṛt-Persian</lang> ;; <lang>Sanskṛt</lang>-<lang>Persian</lang> at 413630
<lang>Sanskṛt-Tibetan</lang> ;; <lang>Sanskṛt</lang>-<lang>Tibetan</lang> at 540128
<lang>Prākr.</lang> -> <lang>Prākṛ.</lang> at 354182
Pl. consider changing the above in cdsl.
I intend to construct lnum_lang_cdsl1.txt tomorrow and upload. Hope you will post lnum_lang_ab.txt.
I had checked the temp_mw_5 file for the corrections mentioned in my above posts and files, and noted that
I have spent some time in analyzing the two versions, and here is the resulting file-- lnum_lang analysis (CDSL vs AB).txt
Hope this would enable Jim to do the final corrections in the cdsl version.
------------------------------------
In case, Jim decides to change the 345 instances of <ab>ep.</ab>
in the cdsl version, they would correspond to 328 instances of <lang>ep.</lang>
in AB version (the rest falling in the deleted dup. entries); we don't have to spend time again in cross-checking those.
I intend to construct lnum_lang_cdsl1.txt tomorrow and upload. Hope you will post lnum_lang_ab.txt.
Sorry for doing some extra work than you asked in my above file; hope you'd find the same useful enough to "derive" what you were aiming at.
This resolves all lang markup differences, based on lnum_lang analysis (CDSL vs AB).txt
;; deleted
by AB are in lla/ab_lnums_del2.txtchange_mw_5_6.txt documents the changes made in 28 lines.
temp_mw_7.txt implements several odds and ends suggested by AB in recent comments. Details in issue65/readme.txt at cp temp_mw_6.txt temp_mw_7.txt
ff. (includes <lang>ep.</lang>
, and a few others)
Request @Andhrabharati to mention if I've overlooked anything.
I think the <lang>
markup now resolved in cdsl version, and consistent with the hidden AB version.
My next step will be to revise the tooltip file (mwab_input.txt in csl-pywork) in light of the \
- 3 exceptions at [lla/diffgroups_lla3.txt] Request AB change 'Class.' to 'class.' in his mw.txt for these three cases.
DONE.
temp_mw_7.txt implements several odds and ends suggested by AB in recent comments. Details in [issue65/readme.txt] at
cp temp_mw_6.txt temp_mw_7.txt
ff. (includes<lang>ep.</lang>
, and a few others)
These details are not seen in the readme file.
Request @Andhrabharati to mention if I've overlooked anything.
Pl. see the points 1 & 3 at this post
And finally, are you implementing this?-
04-10-2024 post zip of temp_mw_5.txt to issue at AB request.
TODO: https://github.com/sanskrit-lexicon/mw-dev/issues/21 AB suggests using '
<gk>X</gk>
' instead of<lang n="Greek">x</lang>
Similarly for Arabic ,<ar>
cp temp_mw_6.txt temp_mw_7.txt
This now found in the readme.txt file.
These are the current tooltips (with counts) for the 112 lang abbreviations in temp_mw_7.txt. @Andhrabharati request you edit these and improve the tips.
@gasyoun and @drdhaval2785 - if you happen to notice this, you also might have some good suggestions for improving the language tooltips
(CDSL): 254101
old Prākṛt (AB): 254101Old Prākṛt (CDSL): 254116
old Prākṛt (AB): 254116 ---------- ;; deleted (dup. entry)
Though Jim has changed "old Prākṛt" to "Old Prākṛt" at 254101, he forgot to do the same at 254116!
And, here is the updated mw_input_2_lang file--
@funderburkjim
You might consider making these changes also--
54627: <lang n="Hindūstānī">Hind.</lang> ;; with this, **Hind.** stands just for **Hindī** (in the mwab list)
413624: <lang n="Persian">P°</lang>
413633: <lang n="Persian">P°</lang>
413633: <lang n="Sanskṛt">S°</lang>
413636: <lang n="Persian">P°</lang>
413636: <lang n="Sanskṛt">S°</lang>
413639: <lang n="Persian">P°</lang>
413639: <lang n="Sanskṛt">S°</lang>
413639: <lang>Arab</lang>
This version has all changes since version 7. There are accompanying mw change files change_7_8.txt and change_8_9.txt.
I aimed to make use of
<ab>, <lang>, <lex>
<gk>X</gk>
and <arab>X</arab>
tagging for text in Greek script or Arabic script. This replaces the former <lang n="greek">X</lang>
and similarly for arab text.
<arab lang="X">Y</arab>
. <arab lang="Turkish">خان</arab>
. Also araGawwa.Everything installed at cologne server.
Perhaps this issue is now closeable? Will wait for @Andhrabharati review.
met. <disp>?</disp> <count>ab,30</count>
;; met. stands for metaphorically
p.p.p. <disp>past participle</disp> <count>ab,1</count>
;; p.p.p. stands for perfect passive participle
And some plural forms like ss.vv., qq.vv., qq.v. need corrections, as they are having the singular form expansions only now. qq.vv. = qq.v. = quae vide ss.vv. = sub vocibus
Also noted that at many places in mw.txt, the comp. stands for compar. and not for "compound".
It appears that all necessary changes are done now, pertaining to this issue; so this issue can be closed now.
Revise mwab_input per previous comment.
Incidental revisions to mw ( ,;
). see change_9_10.txt .
Closing issue. Thanks, @Andhrabharati !
might have some good suggestions for improving the language tooltips
@funderburkjim thanks for reminding. I do have some.
Technical first.
1)
Old HGerm.
2)
Sl. dot
at the end?
3) do not understand why Irish remains empty
Irish
4) @Andhrabharati are there such Indian prakrits?
Avantī
5) prakrits why with ?
Apabhraṃśa
Mahrattī - Indic language spoken mainly in the western Indian state of Maharashtra, from Maratha (= Mahratta). Believe we can add additional hints in the display?
6) stats Should there be a place where the total of
Class.
or
Angl.-Sax.
slight variations is counted? 2+1+135?
Tulu is not a Prakrt variant. It is one of the 7 major Dravidian languages (having own scripts).
@funderburkjim
Just noticed that my file had lagauge
(instead of language) at 5 places, which is taken by you "as is".
Just noticed that my file had lagauge (instead of language)
@Andhrabharati I find no lagauge
in current mw
might have some good suggestions for improving the language tooltips
@gasyoun I believe your comments are based on an outdated version of mwab_input.txt The only item which I changed is the one for the period in Sl.
You may want to review again using current mwab_input.txt.
Regarding your comment at 6. stats
:
mwab_tipcount.txt is an aggregation on 'tooltip'. This is csv file with TAB as separator. It has 311 lines vs. the 422 lines of mwab_input.txt.
Just noticed that my file had lagauge (instead of language)
@Andhrabharati I find no
lagauge
in current mw
They are in the mwab_input file.
They are in the mwab_input file.
got it. Done.
Hindi
is mentioned 18 times in MW. But this quote has no markup in 180704.<p><b>In_Hindi_this_root_often_means_<quote>_to_begin._</quote>