words in different languages

funderburkjim commented 11 months ago

The revisions to bhs.txt discussed in #1 provide markup which identifies the language of various phrases. The summary (from bhs-meta2.txt) is

<fr>   254  : text in French language  (254 instances)
<ger>   61  : text in German language
<tib> 3193  : text in Tibetan language
<gk>     3  : text in Greek language
<lat>    2  : text in Latin language
<toch>   4  : text in Tocharian language

In the displays, such text is shown in 'brown' color, and marked with tooltip (e.g. 'French language' for <fr>X</fr>).

In #1, @Andhrabharati suggested

a programmatic approach to mark the french and german words (using these lists) in the BHS.txt? [I think, I had marked all the Tibetan words in my posted file.]

A few possible problems with this markup have been noticed by random observation, e.g., under Anantarya, unmittelbare Folge should be marked with <ger>

This issue opened as reminder of this idea to enhance bhs digitization.

funderburkjim commented 11 months ago

@Andhrabharati Here are words in bhs.txt that are

in <ger>X</ger>
not found as German words.

Jmd  (3 times) -- probably German
cucullus   Latin?
trāyin  Sanskrit?
admonitio  Latin?
admonere   Latin?
Śakti Sanskrit?
urgā Sanskrit?
balustrade  English

The remaining 164 words in <ger>X</ger> were found to be German.

funderburkjim commented 11 months ago

Instances of Jmd:

hw: Avarjayati
sich Jmd geneigt machen, für sich gewinnen

hw: pariBAzati
Jmd zusprechen, zureden, admonere

hw: saMpratigraha
gute Aufnahme, Vorliebe für Jmd

@maltenth or @fxru Can you tell what 'Jmd' means in these? Is it a German word ?

Andhrabharati commented 11 months ago

Jmd is an abbr. for Jemand (= someone, somebody); and, it is a german word.

Andhrabharati commented 11 months ago

Also I had tried to mark the Tibetan text-strings within italics. Similar exercise was started for French and German text-strings, but is not done fully yet. If this markup makes some sense (and has any benefit), we can resume this part and complete in a short time.

This is what I had mentioned about the fr- and ger- marking (limited to italic strings) at the very beginning.

A few possible problems with this markup have been noticed by random observation, e.g., under Anantarya, unmittelbare Folge should be marked with

I knew very well that more italic strings need to be marked yet; and I've noticed quite a few non-italic strings as well, that belong to other languages. Hence was my request to you to try the programmatic approach using the spellchecker lists.

"I see that you had used some word-lists of German and French, in this BHS repo for some analytics."

a programmatic approach to mark the french and german words (using these lists) in the BHS.txt?

Here I meant continue marking the words using the spellchecker_french.txt and spellchecker_german.txt under the eng_error_lang folder.

But, I presume that you now have started checking the words I had marked so far; I have marked the full italic string as a single lang., though it contained other language words, say like the ones you have pointed above [cucullus, admonitio, admonere : latin; balustrade : english; and trāyin, Śakti, Durgā (not urgā) : Sanskrit.]

funderburkjim commented 11 months ago

addition german, french markup

Work done in issues/issue3 directory.

additional German text markup. 60 instances. Refer change_1_ger.txt
additional French text markup. 42 instances. Refer change_2_fr.txt

@Andhrabharati anything else to do regarding this issue?

Andhrabharati commented 11 months ago

@funderburkjim

I had looked at the readme file and then the change_1_ger file.

I think, your work is apparently limited to the italic strings alone (as seen in the change_1_ger file). Then, I made several quick checks and found more strings that were 'un-caught' by you. some more ger markings.txt

So, the "MarvinJWendt" word-list is also not quite complete, like the spellchecker list!!

This is just the result of a quick search and, I think, more could be lying in the text. [I haven't yet looked at your change_2_fr file, but most probably that also would be in the similar state as the change_1_ger file.]

Andhrabharati commented 11 months ago

So, you need to "traverse" in some different path to identify the words fully [I cannot and should not dare giving you tips and tricks!!], or leave the task to me to take up at sometime later.

funderburkjim commented 11 months ago

Found some more french/german italicized text. See check3a_edit.txt.

This based on examination of italicized text containing non-English word(s): check3.txt.

@Andhrabharati Have I missed any?

Andhrabharati commented 11 months ago

@funderburkjim

Glad that my post is taken by you in good spirit; I had felt later that my wordings are somewhat in 'negative shade'. [Let me also have a closer look for the ger & fr words in the italic portion once.]

I guess, there could be few more botanical (latin) names (you had listed/marked 10 now).

And, pl. do the similar exercise with non-italic text too, for completeness.

Andhrabharati commented 11 months ago

And, would you pl. post your latest file?

Just found that you had got werden in line 5531- {%beklommen werden%}, but missed it in line 55371- {%werden%}!

Andhrabharati commented 11 months ago

recueillements is marked at {%<fr>recueillements</fr>%}, but is it so in {%abstract meditations, trances, <fr>recueillements</fr>?%} also?

And did you get {%<fr>par connexion</fr>%} at line 70249, where {%<fr>en soi</fr>%} is marked?

Andhrabharati commented 11 months ago

I guess, there could be few more botanical (latin) names (you had listed/marked 10 now).

Just an example-- You had marked Agati grandiflora, but left Aeschynomene grandiflora in the same line (43471).

And let's have these marked with <bot> & <zoo> tags (as the case may be), and not with <lat>, as at other CDSL works.

funderburkjim commented 11 months ago

oversites handled

See 'Additional changes' in check3a_edit.txt. These were missed by me in first review of check3.txt.

{%Aeschynomene grandiflora%} was not flagged in check3.txt because the English word list I used had both these words!
For a similar reason, 'par connexion' was not flagged since both words appear in the english word list.

Andhrabharati commented 11 months ago

For a similar reason, 'par connexion' was not flagged since both words appear in the english word list.

Yes, I've seen some more words being in English borrowed from other languages 'as is', and it is debatable whether to mark them as the 'parent' language words!!

One way that I feel a sure 'proper' manner is to decide by the context-- if occurring in the other language work (identifiable by the author's name and/or the work), it could be treated as the foreign 'parent' word.

Andhrabharati commented 11 months ago

I thought I should do the tagging again, and here are the names with their counts [unique (total)]--

Andhrabharati commented 11 months ago

Note esp. that the ls, ab and lang tags are now increased further.

funderburkjim commented 11 months ago

Please upload your bhs_ab_2 version so I can resolve the differences. e.g., My latest version has 309 `fr, compared to your 314.

Andhrabharati commented 11 months ago

Here it is, @funderburkjim -- BHS-AB_2.zip

And, pl. be noted that I have done some addl. corrections too, apart from updating the taggings.

funderburkjim commented 11 months ago

Thanks. I'll focus on the tag counts of your table for now.

Andhrabharati commented 11 months ago

Once you are done with this phase, pl. post your file, and probably close the issue.

Then I can take-up resolving the (latest) unidentified (or doubtful) ab- and ls- tags [as updated by you, using my AB_2 file], in another issue.

funderburkjim commented 11 months ago

additional revisions

temp_bhs_ab_3.zip contains the end result.

Work done in compare sub-directory.

Generally, the abbreviation markup changes of bhs.ab.2 were accepted; My additional changes (of temp_bhs_ab_3.txt) are documented in changes_bhs_ab_3.txt.

After resolving the abbreviation changes, I also identified and applied the remaining differences. These are documented in compare_texts_notes.txt.

temp_bhs_ab_3.txt is now the latest csl-orig verions for bhs, and is the basis of the displays.

The 'tooltip' files (for general abbreviations and literary source abbreviations) were also modified to be consistent with temp_bhs_3.txt markup. Versions with 'counts' are

tagcount_ab.txt general abbreviations
tagcount_ls.txt literary source abbreviations.

Many of these (esp. for ls) are currently only 'placeholders', with '?' as the tooltip. These need to be resolved.I'll open another issue for this tooltip revision.

@Andhrabharati If you accept temp_bhs_ab_3.txt, we can close this issue 3.

Andhrabharati commented 11 months ago

Generally, the abbreviation markup changes of bhs.ab.2 were accepted; My additional changes (of temp_bhs_ab_3.txt) are documented in changes_bhs_ab_3.txt.

------------------------------------------------------------ CHANGES for tags other than '<ls>' ------------------------------------------------------------

See how odd the modifier apostrophe looks at these places (of course, this is a font dependent issue!); we never see such forms in any french print! The caron-forms are what are seen in print.

As such, I suggest using ď (U+101F), ľ (U+013E) and Ľ (U+013D) at these places. -----------------------

AB: <fr>a fortiori</fr> -> <lat>a fortiori</lat> -----------------------

<L>1171<pc>040,1<k1>antaHSalya old: <ger>inner dart</ger> new: inner dart

AB: agreed, I had erroneously marked this as german. -----------------------

<L>3163<pc>115,1<k1>indrapawa old: <ger>so <ab>v.a.</ab></ger> new: so <ab>v.a.</ab>

<L>10517<pc>386,1<k1>pravicAraRa old: <ger>so <ab>v.a.</ab></ger> new: so <ab>v.a.</ab>

This is purely a german form [occured 4500+ times in pwk and 6500+ times in PWG], and I suggest changing both the places where it occurred thus (which were picked up from the resp. german Worterbuch)-- <L>3163<pc>115,1<k1>indrapawa new: <ger>{%Luftgewand%}, so <ab>v.a.</ab> {%Nacktheit%}</ger>

<L>10517<pc>386,1<k1>pravicAraRa new: ‘<ger>{%Unterscheidung%}, so <ab>v.a.</ab> {%Art%}</ger>’

PS. The expansion of <ab>v.a.</ab> may be seen in the tagcount_ab file in the other issue (#4)). -----------------------

<L>6914<pc>251,2<k1>tAyin old: <ger>wohl nur fehlerhaft für</ger> trāyin new: <ger>wohl nur fehlerhaft für trāyin</ger>

AB: agreed -----------------------

<L>9612<pc>347,2<k1>purasta ? Cannot find Ledder as German word

Ledder is a Low German form , and I find this https://wordsense.eu site quite useful in identifying the words and languages. -----------------------

<L>10125<pc>369,1<k1>prativiza wolfsbane is English common name of plant old: wolfsbane new: wolfsbane

AB: agreed -----------------------

global change <lat>ibidem</lat> -> <ab>ibidem</ab> 6 <lat>et alibi</lat> -> <ab>et alibi</ab> 83 <lat>et passim</lat> -> <ab>et passim</ab> 25 <lat>passim</lat> -> <ab>passim</ab> 22

Firstly, this list has missed <lat>et cetera</lat> 3, <lat>ipso facto</lat> 2 and <lat>vice versa</lat> 9 which are also of the same nature.

I suggest retaining all these with lat-tagging; these are all latin phrases (that were brought into English language as is), not abbr.s in any manner. I had followed the point that I mentioned above in marking these thus. -----------------------

<L>4564<pc>171,2<k1>kalambukA old: {%convolvulus repens?%} new: {%<bot>convolvulus repens?</bit<%}

AB: agreed; and as I do in manual marking, you had also erred here </bit<! -----------------------

<L>5940<pc>219,1<k1>grAmeluka old: <lang n="Māgadhi">Mg.</lang> new: <ab n="Māgadhi">Mg.</ab> Reason: cdsl interprets Y : Y is text in language X

I had earlier marked is properly as <lang>Mg.</lang>, it being a language (listed by Edgerton himself), but it had conflicted with <ab>Mg.</ab> (that denotes 'Meaning').

BTW, just noticed that I had missed the ending letter long ī at this tagging.

I think it is appropriate to mark it somehow as a language; but is not a big deal to break the heads over.

------------------------------------------------------------ CHANGES FOR <ls> ------------------------------------------------------------

<L>6220<pc>229,1<k1>cAru old: Caraka new: <ls>Caraka</ls>

AB: disagree; here 'Caraka' is not referring to the legendary proponent of Ayurveda (that is ls-tagged), but to some king. No tagging needed here. -----------------------

<L>3933<pc>149,1<k1>ullumpati old: <ls>BR.</ls> new: <ls>BR</ls>

[Same for the next two as well; so, not elaborating them.]

AB: not a big point to disagree; but just like to mention that there are many cases of ab- and ls- entities occurring with and without a dot followed throughout the text. We should treat is as the author's style, instead of trying to 'normalise' them! -----------------------

<L>11786<pc>424,1<k1>mahABIzma old: <ls>Mahāsamāj., [Page424-b] Waldschmidt, Kl. Skt. Texte 4</ls> new: [Page424-b] <ls>Mahāsamāj., Waldschmidt, Kl. Skt. Texte 4</ls>

<L>3858<pc>146,1<k1>upAnaha old: <ls n="Śāṅkh. Gṛhy. Sūt.">ŚGS</ls> new: <ls>ŚGS</ls> Note: 'n' attribute serves another purpose for 'ls' element

AB: agree for these two changes.

funderburkjim commented 11 months ago

Re French apostrophe

I disagree with use of Latin Small Letter D With Caron (U+010F), and the other two. I think some form of apostrophe should be used after d, l, and L in French, just as it used in cʼest several times in bhs. Ref: https://www.frenchtoday.com/blog/french-pronunciation/elision/

The U+010E has some other purpose, I think. Literally meaning ‘little hook,’ caron (č) represents a rising tone (Ref). This seems to be used in Csech and Slovak, not French.

Similarly, https://en.wiktionary.org/wiki/%C4%BE says that ľ Latin Small Letter L With Caron (U+013E) is in Slovak alphabet. Also see 'Orthography' section of https://en.wikipedia.org/wiki/Slovak_language for this character.

I think in our work, the ʼ U+02BC MODIFIER LETTER APOSTROPHE is used in cʼest and also this is the apostrophe used for many (1500+) other purposes

Andhrabharati commented 11 months ago

OK, @funderburkjim; now, I see that Google shows plenty of french pages with d̕ etc. [LATIN SMALL LETTER D WITH COMMA ABOVE RIGHT, 0064 + 0315].

Of course, Unicode chart itself recommends using 02BC instead of this--

Why not approach someone more knowledgeable in French to confirm and conclude the matter, say Sampada or Odile?

funderburkjim commented 11 months ago

Have sent email requesting help from Odile:

Hi, Odile --

 My colleague Andhrabharati and I have been working on the Cologne digitization of

the Buddhist Hybrid Sanskrit Dictionary.  This dictionary has words in many languages,

including French.  And we need your opinion as French-language expert!

Maybe I can phrase the question as:  How is the apostrophe typically entered in French ?

For example, in  this fragment from Burnouf,  `d'ici. || c'est pourquoi`  we have used the

simple apostrophe character  `D  APOSTROPHE I C I` . 

Is this the common practice with French text which we should follow?  

Or are there special unicode characters (such as Latin Small Letter D With Caron (U+010F)) 
that we should use.
By the way, here  is part of the BHS discussion: https://github.com/sanskrit-lexicon/BHS/issues/3#issuecomment-1686697683

funderburkjim commented 11 months ago

In the cdsl versions of Burnouf and Stchoupak, the simple apostrophe U+0027 is used. e.g., d'ici. || c'est pourquoi in BUR under headword atas.

Odile has contributed extensively to these digitizations.

funderburkjim commented 11 months ago

Here is a way to retain the <lat> tag, and also get tooltips (for these, I think the tooltip is normally important): We have used similar coding <ger>... <ab>v.a.</ab> ... </ger>.

global change With the changes below, the display: a) prints the text in the 'language' color (brown) b) Provides a tooltip (from bhsab_input)

<ab>ibidem</ab> -> <lat><ab>ibidem</ab></lat> 6
<ab>et alibi</ab> -> <lat><ab>et alibi</ab></lat> 83
<ab>et passim</ab> -> <lat><ab>et passim</ab></lat> 25
<ab>passim</ab> -> <lat><ab>passim</ab></lat> 22

funderburkjim commented 11 months ago

comment on the meaning of the `<lat>` tag

The <ger> and <fr> and <tib> tags in bhs mean that

the enclosed text is from the given language,
AND (usually, I think) the text is a meaning (sense, gloss) of some Sanskrit text.

E.g. under aYja ,

Addressed by Brahmā to the Buddha, urging him to preach the law; 
presumed to mean perhaps {%come on!%} 
But <lang>Tib.</lang> seems to have had a quite different reading: 
<tib>kha ḥbyed pa</tib>,  << a gloss in Tibetan language
{%mouth open%}               
(<ls>Foucaux</ls>, {%<fr>ouvre ta bouche</fr>%};   << a gloss in French language

But for Latin, the text might NOT be a gloss of other text, but rather a 'meta' comment in Latin that the author expects to be understood by a classically educated European reader. For instance, under headword arhant,

the ideal personage in Hīnayāna Buddhism, 
fourth and last stage in religious development (see {@srota-āpanna@}), 
<ls>SP</ls> 〔1.6〕 
<lat><ab>et passim</ab></lat>

'et passant' is NOT a Latin gloss of some Sanskrit text, but rather a comment in Latin which probably means that there are several instances of arhant in SP (Saddharmapuṇḍarīka) reference in addition to one at location '1.6'.

In such a circumstance, it seems appropriate to provide a tooltip for the latin text. Note that 'et passim' has been marked as both latin and an abbreviation (for toolip); I think the lat markup is not needed, but it does no particular damage to be present.

Another observation shows that sometimes at least the author considers a latin text to actually be part of his English:

like <lang>Eng.</lang> {%<lat>et cetera</lat>%}   
    << author consider 'et cetera' English!

end of rant!

funderburkjim commented 11 months ago

Here is Odile's reply:

Hello Jim, nice to hear from you. ‌Of course like you thought, we don't use any special unicode characters (such as Latin Small Letter D With Caron (U+010F)) including the apostrophe, maybe your colleague is influenced by the generalization of use of ligatures in Indian scripts. As ligature we use only œ I think. (I am not sure "ligature" is the proper English word)
According to French MWord, the unicode for the apostrophe is 2019. the 0027 is the one we have acces directly on the keyboard (like here, ') but actually is not the French one which should be inclined on the right.
So:
"U+2019 (French GUILLEMET APOSTROPHE; Engl. RIGHT SINGLE QUOTATION MARK)"

note that in French we "make the apostrophe by writing from the upper right to the lower left, ... basically backwards from the way we do it in the US."
(after reading that I understood why your apostrophes in this email were all looking strange to me, as i was used to see only vertical ones in English text)

I note that in your example " cʼest ", you use the 02BC code, but this is not the code to be used in French (in brief, because "c'est" is not a single word). (and also contrary of what says the blog you mentioned, "c" in "c'est" is standing for "cela", but that is a detail she probably didn't wanted to mention, to make things simple)

About, "OK, @funderburkjim; now, I see that Google shows plenty of french pages with d̕ etc. [LATIN SMALL LETTER D WITH COMMA ABOVE RIGHT, 0064 + 0315]."

I think this come from the digitilation software which is international and not from a specific choice.

funderburkjim commented 11 months ago

Also see https://en.wikipedia.org/wiki/Right_single_quotation_mark.

My conclusion: a separate apostrophe (such as in cʼest) is what we should use. The particular unicode code point to use for this apostrophe in French is not definite.

Currently, we are using u02bc for the apostrophe (1523 instances) in bhs, both within French and english.
I think Odile is preferring u+2019 apostrophe for French. Nevertheless, I think it is ok for us to use u02bc throughout, including for French. Obviously this is a subtle (and relatively small) point, probably with no universally accepted answer.

Let's keep the apostrophe with u02bc. @Andhrabharati can you agree?

Andhrabharati commented 11 months ago

I do agree, @funderburkjim !

I have some reservation in using the right_single_quotation_mark, as it conflicts with my matching_pairs 'logic' (for the same reason, I had resorted to the '〉' in place of closing parenthesis mark ')' though it is present thus in the print).

Let's stick to the u02bc, as I did in GRA, pw set etc. recently.

funderburkjim commented 11 months ago

Great!

further revisions

I think all the items mentioned in comment above have been handled. See changes_bhs_ab_3a.txt for details of the changes made to bhs.ab.3.

@Andhrabharati Please check if I've missed anything.

funderburkjim commented 11 months ago

Please note correction made to documentation 'changes_bhs_ab_3a' for kaqambA.

Andhrabharati commented 11 months ago

At L-3163 and L-10517, the tagging you used makes "so v.a." also to be italic, which is not the case as per the print notation; whereas my marking conforms with the print.

Is there any reason behind your marking thus?

funderburkjim commented 11 months ago

Apparently I was careless. Have updated changes_bhs_ab_3a.txt. These changes reflected in local installation displays (not yet in Cologne displays):

Andhrabharati commented 11 months ago

Good; so, it's time to close this issue?

funderburkjim commented 11 months ago

Revisions now installed at Cologne. Closing issue.

Andhrabharati commented 11 months ago

@funderburkjim

You need to update the meta2 file in the download sets.

Just seen that the details on it are somewhat obsolete now, namely the tag counts and some of the extended characters/counts.

gasyoun commented 11 months ago

Tibetan language

So many, would never expect

sanskrit-lexicon / BHS