Closed funderburkjim closed 11 months ago
@Andhrabharati Here are words in bhs.txt that are
<ger>X</ger>
Jmd (3 times) -- probably German
cucullus Latin?
trāyin Sanskrit?
admonitio Latin?
admonere Latin?
Śakti Sanskrit?
urgā Sanskrit?
balustrade English
The remaining 164 words in <ger>X</ger>
were found to be German.
Instances of Jmd:
hw: Avarjayati
sich Jmd geneigt machen, für sich gewinnen
hw: pariBAzati
Jmd zusprechen, zureden, admonere
hw: saMpratigraha
gute Aufnahme, Vorliebe für Jmd
@maltenth or @fxru Can you tell what 'Jmd' means in these? Is it a German word ?
Jmd is an abbr. for Jemand (= someone, somebody); and, it is a german word.
Also I had tried to mark the Tibetan text-strings within italics. Similar exercise was started for French and German text-strings, but is not done fully yet. If this markup makes some sense (and has any benefit), we can resume this part and complete in a short time.
This is what I had mentioned about the fr- and ger- marking (limited to italic strings) at the very beginning.
A few possible problems with this markup have been noticed by random observation, e.g., under Anantarya, unmittelbare Folge should be marked with
I knew very well that more italic strings need to be marked yet; and I've noticed quite a few non-italic strings as well, that belong to other languages. Hence was my request to you to try the programmatic approach using the spellchecker lists.
"I see that you had used some word-lists of German and French, in this BHS repo for some analytics."
a programmatic approach to mark the french and german words (using these lists) in the BHS.txt?
Here I meant continue marking the words using the spellchecker_french.txt and spellchecker_german.txt under the eng_error_lang folder.
But, I presume that you now have started checking the words I had marked so far; I have marked the full italic string as a single lang., though it contained other language words, say like the ones you have pointed above [cucullus, admonitio, admonere : latin; balustrade : english; and trāyin, Śakti, Durgā (not urgā) : Sanskrit.]
Work done in issues/issue3 directory.
@Andhrabharati anything else to do regarding this issue?
@funderburkjim
I had looked at the readme file and then the change_1_ger file.
I think, your work is apparently limited to the italic strings alone (as seen in the change_1_ger file). Then, I made several quick checks and found more strings that were 'un-caught' by you. some more ger markings.txt
So, the "MarvinJWendt" word-list is also not quite complete, like the spellchecker list!!
This is just the result of a quick search and, I think, more could be lying in the text. [I haven't yet looked at your change_2_fr file, but most probably that also would be in the similar state as the change_1_ger file.]
So, you need to "traverse" in some different path to identify the words fully [I cannot and should not dare giving you tips and tricks!!], or leave the task to me to take up at sometime later.
Found some more french/german italicized text. See check3a_edit.txt.
This based on examination of italicized text containing non-English word(s): check3.txt.
@Andhrabharati Have I missed any?
@funderburkjim
Glad that my post is taken by you in good spirit; I had felt later that my wordings are somewhat in 'negative shade'. [Let me also have a closer look for the ger & fr words in the italic portion once.]
I guess, there could be few more botanical (latin) names (you had listed/marked 10 now).
And, pl. do the similar exercise with non-italic text too, for completeness.
And, would you pl. post your latest file?
Just found that you had got werden in line 5531- {%beklommen werden%}, but missed it in line 55371- {%werden%}!
recueillements is marked at {%<fr>recueillements</fr>%}
, but is it so in {%abstract meditations, trances, <fr>recueillements</fr>?%}
also?
And did you get {%<fr>par connexion</fr>%}
at line 70249, where {%<fr>en soi</fr>%}
is marked?
I guess, there could be few more botanical (latin) names (you had listed/marked 10 now).
Just an example-- You had marked Agati grandiflora, but left Aeschynomene grandiflora in the same line (43471).
And let's have these marked with <bot>
& <zoo>
tags (as the case may be), and not with <lat>
, as at other CDSL works.
See 'Additional changes' in check3a_edit.txt. These were missed by me in first review of check3.txt.
{%Aeschynomene grandiflora%}
was not flagged in check3.txt because
the English word list I used had both these words!
For a similar reason, 'par connexion' was not flagged since both words appear in the english word list.
For a similar reason, 'par connexion' was not flagged since both words appear in the english word list.
Yes, I've seen some more words being in English borrowed from other languages 'as is', and it is debatable whether to mark them as the 'parent' language words!!
One way that I feel a sure 'proper' manner is to decide by the context-- if occurring in the other language work (identifiable by the author's name and/or the work), it could be treated as the foreign 'parent' word.
I thought I should do the tagging again, and here are the names with their counts [unique (total)]--
Note esp. that the ls, ab and lang tags are now increased further.
Please upload your bhs_ab_2 version so I can resolve the differences. e.g., My latest version has 309 `fr, compared to your 314.
Here it is, @funderburkjim -- BHS-AB_2.zip
And, pl. be noted that I have done some addl. corrections too, apart from updating the taggings.
Thanks. I'll focus on the tag counts of your table for now.
Once you are done with this phase, pl. post your file, and probably close the issue.
Then I can take-up resolving the (latest) unidentified (or doubtful) ab- and ls- tags [as updated by you, using my AB_2 file], in another issue.
temp_bhs_ab_3.zip contains the end result.
Work done in compare sub-directory.
Generally, the abbreviation markup changes of bhs.ab.2 were accepted; My additional changes (of temp_bhs_ab_3.txt) are documented in changes_bhs_ab_3.txt.
After resolving the abbreviation changes, I also identified and applied the remaining differences. These are documented in compare_texts_notes.txt.
temp_bhs_ab_3.txt is now the latest csl-orig verions for bhs, and is the basis of the displays.
The 'tooltip' files (for general abbreviations and literary source abbreviations) were also modified to be consistent with temp_bhs_3.txt markup. Versions with 'counts' are
Many of these (esp. for ls) are currently only 'placeholders', with '?' as the tooltip. These need to be resolved.I'll open another issue for this tooltip revision.
@Andhrabharati If you accept temp_bhs_ab_3.txt, we can close this issue 3.
Generally, the abbreviation markup changes of bhs.ab.2 were accepted; My additional changes (of temp_bhs_ab_3.txt) are documented in changes_bhs_ab_3.txt.
------------------------------------------------------------
CHANGES for tags other than '<ls>'
------------------------------------------------------------
See how odd the modifier apostrophe looks at these places (of course, this is a font dependent issue!); we never see such forms in any french print! The caron-forms are what are seen in print.
As such, I suggest using ď (U+101F), ľ (U+013E) and Ľ (U+013D) at these places.
-----------------------
AB: <fr>a fortiori</fr>
-> <lat>a fortiori</lat>
-----------------------
<L>1171<pc>040,1<k1>antaHSalya
old: <ger>inner dart</ger>
new: inner dart
AB: agreed, I had erroneously marked this as german.
-----------------------
<L>3163<pc>115,1<k1>indrapawa
old: <ger>so <ab>v.a.</ab></ger>
new: so <ab>v.a.</ab>
<L>10517<pc>386,1<k1>pravicAraRa
old: <ger>so <ab>v.a.</ab></ger>
new: so <ab>v.a.</ab>
This is purely a german form [occured 4500+ times in pwk and 6500+ times in PWG], and I suggest changing both the places where it occurred thus (which were picked up from the resp. german Worterbuch)--
<L>3163<pc>115,1<k1>indrapawa
new: <ger>{%Luftgewand%}, so <ab>v.a.</ab> {%Nacktheit%}</ger>
<L>10517<pc>386,1<k1>pravicAraRa
new: ‘<ger>{%Unterscheidung%}, so <ab>v.a.</ab> {%Art%}</ger>’
PS. The expansion of <ab>v.a.</ab>
may be seen in the tagcount_ab file in the other issue (#4)).
-----------------------
<L>6914<pc>251,2<k1>tAyin
old: <ger>wohl nur fehlerhaft für</ger> trāyin
new: <ger>wohl nur fehlerhaft für trāyin</ger>
AB: agreed
-----------------------
<L>9612<pc>347,2<k1>purasta ? Cannot find Ledder as German word
Ledder is a Low German form , and I find this https://wordsense.eu site quite useful in identifying the words and languages.
-----------------------
<L>10125<pc>369,1<k1>prativiza
wolfsbane is English common name of plant old:wolfsbane new: wolfsbane
AB: agreed
-----------------------
global change
<lat>ibidem</lat> -> <ab>ibidem</ab> 6
<lat>et alibi</lat> -> <ab>et alibi</ab> 83
<lat>et passim</lat> -> <ab>et passim</ab> 25
<lat>passim</lat> -> <ab>passim</ab> 22
Firstly, this list has missed <lat>et cetera</lat> 3
, <lat>ipso facto</lat> 2
and <lat>vice versa</lat> 9
which are also of the same nature.
I suggest retaining all these with lat-tagging; these are all latin phrases (that were brought into English language as is), not abbr.s in any manner. I had followed the point that I mentioned above in marking these thus.
-----------------------
<L>4564<pc>171,2<k1>kalambukA
old: {%convolvulus repens?%}
new: {%<bot>convolvulus repens?</bit<%}
AB: agreed; and as I do in manual marking, you had also erred here </bit<
!
-----------------------
<L>5940<pc>219,1<k1>grAmeluka
old: <lang n="Māgadhi">Mg.</lang>
new: <ab n="Māgadhi">Mg.</ab>
Reason: cdsl interpretsY : Y is text in language X
I had earlier marked is properly as <lang>Mg.</lang>
, it being a language
(listed by Edgerton himself), but it had conflicted with <ab>Mg.</ab>
(that denotes 'Meaning').
BTW, just noticed that I had missed the ending letter long ī at this tagging.
I think it is appropriate to mark it somehow as a language; but is not a big deal to break the heads over.
------------------------------------------------------------
CHANGES FOR <ls>
------------------------------------------------------------
<L>6220<pc>229,1<k1>cAru
old: Caraka
new: <ls>Caraka</ls>
AB: disagree; here 'Caraka' is not referring to the legendary proponent of Ayurveda (that is ls-tagged), but to some king. No tagging needed here.
-----------------------
<L>3933<pc>149,1<k1>ullumpati
old: <ls>BR.</ls>
new: <ls>BR</ls>
[Same for the next two as well; so, not elaborating them.]
AB: not a big point to disagree; but just like to mention that there are many cases of ab- and ls- entities occurring with and without a dot followed throughout the text. We should treat is as the author's style, instead of trying to 'normalise' them!
-----------------------
<L>11786<pc>424,1<k1>mahABIzma
old: <ls>Mahāsamāj., [Page424-b] Waldschmidt, Kl. Skt. Texte 4</ls>
new: [Page424-b] <ls>Mahāsamāj., Waldschmidt, Kl. Skt. Texte 4</ls>
<L>3858<pc>146,1<k1>upAnaha
old: <ls n="Śāṅkh. Gṛhy. Sūt.">ŚGS</ls>
new: <ls>ŚGS</ls>
Note: 'n' attribute serves another purpose for 'ls' element
AB: agree for these two changes.
Re French apostrophe
I disagree with use of Latin Small Letter D With Caron (U+010F)
, and the other two.
I think some form of apostrophe should be used after d, l, and L in French, just as it used in cʼest
several times in bhs.
Ref: https://www.frenchtoday.com/blog/french-pronunciation/elision/
The U+010E has some other purpose, I think. Literally meaning ‘little hook,’ caron (č) represents a rising tone
(Ref). This seems to be used in Csech and Slovak, not French.
Similarly, https://en.wiktionary.org/wiki/%C4%BE says that ľ Latin Small Letter L With Caron (U+013E)
is in Slovak alphabet.
Also see 'Orthography' section of https://en.wikipedia.org/wiki/Slovak_language for this character.
I think in our work, the ʼ U+02BC MODIFIER LETTER APOSTROPHE
is used in cʼest and also this is the apostrophe used
for many (1500+) other purposes
OK, @funderburkjim; now, I see that Google shows plenty of french pages with d̕ etc. [LATIN SMALL LETTER D WITH COMMA ABOVE RIGHT, 0064 + 0315].
Of course, Unicode chart itself recommends using 02BC instead of this--
Why not approach someone more knowledgeable in French to confirm and conclude the matter, say Sampada or Odile?
Have sent email requesting help from Odile:
Hi, Odile --
My colleague Andhrabharati and I have been working on the Cologne digitization of
the Buddhist Hybrid Sanskrit Dictionary. This dictionary has words in many languages,
including French. And we need your opinion as French-language expert!
Maybe I can phrase the question as: How is the apostrophe typically entered in French ?
For example, in this fragment from Burnouf, `d'ici. || c'est pourquoi` we have used the
simple apostrophe character `D APOSTROPHE I C I` .
Is this the common practice with French text which we should follow?
Or are there special unicode characters (such as Latin Small Letter D With Caron (U+010F))
that we should use.
By the way, here is part of the BHS discussion: https://github.com/sanskrit-lexicon/BHS/issues/3#issuecomment-1686697683
In the cdsl versions of Burnouf and Stchoupak, the simple apostrophe U+0027 is used.
e.g., d'ici. || c'est pourquoi
in BUR under headword atas
.
Odile has contributed extensively to these digitizations.
Here is a way to retain the <lat>
tag, and also get tooltips (for these, I think the
tooltip is normally important):
We have used similar coding <ger>... <ab>v.a.</ab> ... </ger>
.
global change With the changes below, the display: a) prints the text in the 'language' color (brown) b) Provides a tooltip (from bhsab_input)
<ab>ibidem</ab> -> <lat><ab>ibidem</ab></lat> 6
<ab>et alibi</ab> -> <lat><ab>et alibi</ab></lat> 83
<ab>et passim</ab> -> <lat><ab>et passim</ab></lat> 25
<ab>passim</ab> -> <lat><ab>passim</ab></lat> 22
<lat>
tagThe <ger>
and <fr>
and <tib>
tags in bhs mean that
E.g. under aYja ,
Addressed by Brahmā to the Buddha, urging him to preach the law;
presumed to mean perhaps {%come on!%}
But <lang>Tib.</lang> seems to have had a quite different reading:
<tib>kha ḥbyed pa</tib>, << a gloss in Tibetan language
{%mouth open%}
(<ls>Foucaux</ls>, {%<fr>ouvre ta bouche</fr>%}; << a gloss in French language
But for Latin, the text might NOT be a gloss of other text, but rather a 'meta' comment in Latin that the author expects to be understood by a classically educated European reader. For instance, under headword arhant,
the ideal personage in Hīnayāna Buddhism,
fourth and last stage in religious development (see {@srota-āpanna@}),
<ls>SP</ls> 〔1.6〕
<lat><ab>et passim</ab></lat>
'et passant' is NOT a Latin gloss of some Sanskrit text, but rather a comment in Latin which probably means that there are several instances of arhant in SP (Saddharmapuṇḍarīka) reference in addition to one at location '1.6'.
In such a circumstance, it seems appropriate to provide a tooltip for the latin text. Note that 'et passim' has been marked as both latin and an abbreviation (for toolip); I think the lat markup is not needed, but it does no particular damage to be present.
Another observation shows that sometimes at least the author considers a latin text to actually be part of his English:
like <lang>Eng.</lang> {%<lat>et cetera</lat>%}
<< author consider 'et cetera' English!
end of rant!
Here is Odile's reply:
Hello Jim, nice to hear from you. Of course like you thought, we don't use any special unicode characters (such as Latin Small Letter D With Caron (U+010F)) including the apostrophe, maybe your colleague is influenced by the generalization of use of ligatures in Indian scripts. As ligature we use only œ I think. (I am not sure "ligature" is the proper English word)
According to French MWord, the unicode for the apostrophe is 2019. the 0027 is the one we have acces directly on the keyboard (like here, ') but actually is not the French one which should be inclined on the right.
So:
"U+2019 (French GUILLEMET APOSTROPHE; Engl. RIGHT SINGLE QUOTATION MARK)"
note that in French we "make the apostrophe by writing from the upper right to the lower left, ... basically backwards from the way we do it in the US."
(after reading that I understood why your apostrophes in this email were all looking strange to me, as i was used to see only vertical ones in English text)
I note that in your example " cʼest ", you use the 02BC code, but this is not the code to be used in French (in brief, because "c'est" is not a single word). (and also contrary of what says the blog you mentioned, "c" in "c'est" is standing for "cela", but that is a detail she probably didn't wanted to mention, to make things simple)
About, "OK, @funderburkjim; now, I see that Google shows plenty of french pages with d̕ etc. [LATIN SMALL LETTER D WITH COMMA ABOVE RIGHT, 0064 + 0315]."
I think this come from the digitilation software which is international and not from a specific choice.
Also see https://en.wikipedia.org/wiki/Right_single_quotation_mark.
My conclusion: a separate apostrophe (such as in cʼest) is what we should use. The particular unicode code point to use for this apostrophe in French is not definite.
Currently, we are using u02bc for the apostrophe (1523 instances) in bhs, both within
French and english.
I think Odile is preferring u+2019 apostrophe for French.
Nevertheless, I think it is ok for us to use u02bc throughout, including for French.
Obviously this is a subtle (and relatively small) point, probably with no universally accepted answer.
Let's keep the apostrophe with u02bc. @Andhrabharati can you agree?
I do agree, @funderburkjim !
I have some reservation in using the right_single_quotation_mark, as it conflicts with my matching_pairs 'logic' (for the same reason, I had resorted to the '〉' in place of closing parenthesis mark ')' though it is present thus in the print).
Let's stick to the u02bc, as I did in GRA, pw set etc. recently.
Great!
I think all the items mentioned in comment above have been handled. See changes_bhs_ab_3a.txt for details of the changes made to bhs.ab.3.
@Andhrabharati Please check if I've missed anything.
Please note correction made to documentation 'changes_bhs_ab_3a' for kaqambA.
At L-3163 and L-10517, the tagging you used makes "so
Is there any reason behind your marking thus?
Apparently I was careless. Have updated changes_bhs_ab_3a.txt. These changes reflected in local installation displays (not yet in Cologne displays):
Good; so, it's time to close this issue?
Revisions now installed at Cologne. Closing issue.
@funderburkjim
You need to update the meta2 file in the download sets.
Just seen that the details on it are somewhat obsolete now, namely the tag counts and some of the extended characters/counts.
Tibetan language
So many, would never expect
The revisions to bhs.txt discussed in #1 provide markup which identifies the language of various phrases. The summary (from bhs-meta2.txt) is
In the displays, such text is shown in 'brown' color, and marked with tooltip (e.g. 'French language' for
<fr>X</fr>
).In #1, @Andhrabharati suggested
A few possible problems with this markup have been noticed by random observation, e.g., under Anantarya,
unmittelbare Folge
should be marked with<ger>
This issue opened as reminder of this idea to enhance bhs digitization.