Closed Andhrabharati closed 1 year ago
The metalines have quite many entries with =
in k2-field.
These need to be appropriately corrected.
The abbr and ls list is missing in the digitisation.
The same is done now (from the part-1 of the work) and posted here. BHS Grammar_Front pages.txt
Many entries seem to have composite words listed under the entry, without any such notation/marking.
All those composite words could be made "prominent".
Just as in BUR, many grouped entries (OR, AND, ALSO, CSV lists) are seen in this BHS as well, which could appropriately be "handled" as in MW.
Now I've started doing all these and many more corrections/changes in the BHS text.
Shall be posting the file, once I am done with the work.
Sterling
Good catch
Done, @Andhrabharati ?
@funderburkjim
Here is my version of BHS, marking various tags like abbr.s, acr.s and ls entities.
[I guess you would find this file usable at CDSL as is.]
Also I had tried to mark the Tibetan text-strings within italics. Similar exercise was started for French and German text-strings, but is not done fully yet. If this markup makes some sense (and has any benefit), we can resume this part and complete in a short time.
PS. I had earlier posted above the abbr.s and ls entities listed in the first volume of Edgerton (The BHS Grammar), which is common with the second volume (The BHS Dictionary) as well.
@Andhrabharati What is 'acr' element ?
466 matches in 427 lines for "
acronym.
I had marked the words that appeared as "informal" type (non-standard ?) as acronyms, though I know that they aren't acronyms "by definition".
<ms>
is Manuscript?
Yes.
<!ELEMENT tib (#PCDATA) > <!-- Tibetan text, bhs-->
<!ELEMENT ger (#PCDATA) > <!-- German text, bhs-->
<!ELEMENT fr (#PCDATA | i | ab)* > <!-- French text, bhs-->
<!ELEMENT ed (#PCDATA) > <!-- edition bhs-->
<!ELEMENT ms (#PCDATA) > <!-- manuscript bhs-->
<!ELEMENT lat (#PCDATA) > <!-- latin text bhs-->
<!ELEMENT toch (#PCDATA) > <!-- Tocharian text bhs-->
<!ELEMENT acr (#PCDATA) > <!-- acronym bhs-->
NOTE: removed per discussion below
There are 6 cases where a given (global) abbreviation is marked in two ways in bhs
check_dups found dup abbrev: "acc."
OLD ab acc.::932
NEW lex acc.::615
check_dups found dup abbrev: "accs."
OLD ab accs.::1
NEW lex accs.::6
check_dups found dup abbrev: "conj."
OLD ab conj.::1
NEW lex conj.::4
check_dups found dup abbrev: "derivs."
OLD ab derivs.::1
NEW lex derivs.::24
check_dups found dup abbrev: "n."
OLD ab n.::6542
NEW lex n.::45
check_dups found dup abbrev: "nom."
OLD ab nom.::2
NEW lex nom.::110
check_dups found 6 duplicate abbrevations
Should the file be changed to use just one markup?
Should the file be changed to use just one markup?
I did this markup long back, before I ventured upon looking FULLY at any other work; in recent times, I started marking the abbr.s as local and global types, if they have different expansions.
OLD ab acc.::932 --> this denotes according NEW lex acc.::615 --> this denotes accusative
OLD ab accs.::1 --> this can be changed to lex type NEW lex accs.::6
OLD ab conj.::1 --> this can be changed to lex type NEW lex conj.::4
OLD ab derivs.::1 --> this can be changed to lex type NEW lex derivs.::24
OLD ab n.::6542 --> this denotes name NEW lex n.::45 --> this denotes nominative
[But, the n. in <ab>n. act.</ab>
, <ab>n. ag.</ab>
, <ab>n. pl.</ab>
, <ab>n. pr.</ab>
, <ab>n. sg.</ab>
is to be expanded as nominative.]
OLD ab nom.::2 --> this can be changed to lex type NEW lex nom.::110
@funderburkjim
Just noted that this repo has "eng_error_lang" folder having german, french and tibetan words listed in separate files. [I did my markups independently earlier.]
Thinking of resuming the <fr and <ger markup using your lists, had a quick look at the above 3 files.
Seen that the french list has für 5 (which is a german word and is in german list as well), and svastika 1 (which is a sanskrit word).
My Tibetan marking covered much more words than listed in your file.
Do you see any point in continuing this markup, as I had asked above?
My Tibetan marking covered much more words than listed in your file.
Where is yours?
ìnside the file.
Work materials present in the issues/issue1/ folder.
The displays for bhs have been modified to utilize the various tooltips. Current dev1 version of displays at https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1/web/
The main thing I had to do was match the abbreviation material of the BHS front matter with the markup of the revised bhs.txt. Two files show the matching:
<ab>
, <lex>
, <lang>
tags. In the displays, the lex and lang tags tooltips are considered part of the general abbreviations 'ab' tag tooltips.These files are tsv (tab-separated-values) files with 3 fields:
{AbhidhK.}
means
use tooltip for abbreviation
AbhidhK.which is
Abhidharmakośa, transl. LaVallée Poussin ...`.The main thing remaining to be done is to examine the tips with a '?'. There are many (450+) of these abbreviations for ls, although these represent only 1150+ instances in bhs; by contrast, 47115 ls instances have an assigned tooltip -- so these unassigned abbreviations represent less than 2% of the instances.
@Andhrabharati If you decide to examine those '?', the simplest way to communicate the results back to me would be for you to edit (copies of) the match_X_final.txt files, from which a program extracts the tooltip file used in displays.
@Andhrabharati If you make revisions in your bhs.txt digitization, please take into account the changes which I have made in the version you uploaded. These changes are in file change_1_ab.txt (23 lines changed). A small number of possible changes are mentioned at 'Possible mis-labeling' in litsrc/readme.txt.
unassigned abbreviations represent less than 2% of the instances.
Not a big issue after all.
@Andhrabharati If you decide to examine those '?', the simplest way to communicate the results back to me would be for you to edit (copies of) the match_X_final.txt files, from which a program extracts the tooltip file used in displays.
Sure @funderburkjim , I can have a look at those; but probably after a day or two more-- currently I am making the PWG in the same format as pwk and pwkvn, so that all three "as a set" could go together in uniform manner; at the moment these three are in three different formats!)
unassigned abbreviations represent less than 2% of the instances.
Not a big issue after all.
You are wrong @gasyoun !
Jim didn't put his statement properly to 'reach' you. 450 (though many are just variants of a few) items [leave aside their occurrences!!] out of 1026 is NOT a small percentage.
A small number of possible changes are mentioned at 'Possible mis-labeling' in litsrc/readme.txt.
Here are my responses point-wise--
1 <lang>JM</lang> -> <ls>JM</ls> Jaina Māhārāṣṭrī.
41 <lang>JM.</lang> -> <ls>JM.</ls> Jaina Māhārāṣṭrī.
;;AB this is language only, being the Jaina variant of the Māhārāṣṭrī Prākrit.
3 <lang>Mg.</lang> -> <ab>Mg.</ab> meaning
;; AB under L-5940, this indicates the language Māgadhi; while at the other two places it is <ab>Mg.</ab>
: meaning, which could be made as <ab>mg.</ab>
(1000+ times)
2 <ab>Bodhicaryāv.</ab> -> <ls>Bodhicaryāv.</ls>
;; AB agreed; it was an error on my side.
1 <ab>Dh</ab> -> Dh (no markup)
;;AB it should be made as <ls>Dh</ls>
at L-12899 (which again occurs at L-12899 in such form), whereas <ls>Dh.</ls>
occurs at all other places (38 times)
2 <ab>P.K.</ab> -> <ab n="Pūraṇa Kāśyapa">P.K.</ab>
;;AB agreed
antareṇa , instr. of n. used as adv. and prep. neither 'name' nor 'nominative' tooltip makes sense ?
;;AB here n. could to be taken as 'noun'; see the term 'noun' being used to describe it in this L-1154 entry (and in the prev. one L-1153).
[In grammar, the instrumental case (abbreviated INS or INSTR) is a grammatical case used to indicate that a noun is the instrument or means by or with which the subject achieves or accomplishes an action. The noun may be either a physical object or an abstract concept.]
So, can we mark it as <ab n="noun">n.<ab>
(a local abbr.) instead?
The displays for bhs have been modified to utilize the various tooltips. Current dev1 version of displays at https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1/web/
@funderburkjim
Just checked for the <acr>
cases and, they are not 'used'.
<acr>altho</acr>: although
(21)
<acr>Altho</acr>: Although
(1)
<acr>qy</acr>: query
(10)
<acr>Sktism</acr>: Sanskritism
(34)
<acr>Sktization</acr>: Sanskritization
(61)
<acr>Sktized</acr>: Sanskritized
(38)
<acr>tho</acr>: though
(222)
<acr>thru</acr>: through
(79)
Have you opted to ignore them?
<acr>
cases and, they are not 'used'.
Unfortunately, I forgot about your introduction of this new markup tag, as still another abbreviation. Now I see that you mentioned acr in this comment above.
I will attend to this.
The 'acr' examples are NOT acronyms:
An acronym is an abbreviation that is formed by taking the initial letters of the words in a phrase and creating a new word that is pronounceable.
Some common acronyms include ASAP (As Soon As Possible), BAE (Before Anyone Else), FOMO (Fear Of Missing Out), GIF (Graphics Interchange Format), LOL (Laughing Out Loud), PIN (Personal Identification Number), TTYL (Talk To You Later) and YOLO (You Only Live Once)¹³. These acronyms are used in various contexts such as pop culture, chat, military and government¹⁴⁵.
Source: Conversation with Bing, 8/6/2023
If we choose to mark them in some way that will permit tooltips, why not just use the 'ab' tag?
The 'acr' examples are NOT acronyms
I knew this and mentioned the same in my above post, @funderburkjim . [Seeing such forms in a formal dictionary is a bit awkward; so I thought these should be "marked" somehow.]
These are not even slang words(!!), but are short forms used in informal language; seen that these are appearing in the urbandictionary.com.
Yes, we sure can put them under ab-tag.
@funderburkjim
Hope you had noted my opening post here, to correct while making this work 'public'. Probably this BHS work could go 'public' now itself, without waiting for my resolving the pending '?' tooltips.
I think these take into account the details mentioned since this comment above.
Revised display url: https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1a/web/ Revised bhs.txt: temp_bhs_ab_1.zip
The current dev markup and displays could be installed now at Cologne: they improve the current cdsl displays.
@Andhrabharati You raised several questions regarding possible additional work on bhs.
<fr>
, <ger>
, and <tib>
markup. I do think (1) and (2) should be completed sometime: such markup will be useful to the analysis of bhs by some future scholar.
Whether these revisions are done now or later depends on your schedule. If you decide to do this in the near future (by modifying temp_bhs_1_ab.txt), then I'll install that revision into cdsl. Otherwise, I'll install the current temp_bhs_1_ab.txt into cdsl.
(3) has less immediate importance. It can be done later.
One other comment for possible future revision will be mentioned in another comment below.
In the revised temp_bhs_1_ab.txt version, the line breaks have been removed - the text for each entry is joined into one text 'blob'. We have generally tried to preserve line-break information in those digitizations which originally honored line breaks. This has some utility in tracking down user corrections.
I think it is ok to forget printed line breaks, as has been done here in bhs..
For instance, it makes identification of patterns (e.g. ls references) easier.
See #2 for an idea how to address the long-lines currently in bhs digitization.
I would like to suggest showing the tib-, fr- and ger- strings differently from the rest (eng-) of the text [may be in a different colour/background, say yellow]; this would immediately catch/attract the user's eye.
Without such rendering, the markup just lies hidden inside the text file (which NOT many users would ever 'know'); and I do not see much benefit taking up the work-(2) to 'completion'.
Pl. push this file to cdsl, as I am still "meddling" with PWG; and it might take few more days to close this work to my satisfaction.
one small correction-- the tooltip for Skt. is now Sanskri Language, with t missing.
@Andhrabharati You raised several questions regarding possible additional work on bhs.
I had mentioned about marking the composite words as well, in addition to the grouped words.
Revised display url: https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1b/web/.
No changes to temp_bhs_ab_1.
<ed>
markup:See litsrc/readme.txt at '08-08-2023 revision' / 'match_ab_final.txt' Note AnSS tooltip not known.
Now appears in brown color. Noticed a couple of unmarked German text fragments.
cdsl (and the prior dev1a displays) use oldstandard_i font for italic text (<i>X</i>
). However, this displays as not only
italic but also bold. I changed this so oldstandard_i font is NOT used, so italic text is just italic. I think this looks better.
Note: This will also apply to other dictionaries.
@Andhrabharati <ms>Ḱ</ms>
occurs 57 times (no other <ms>
), Do you know what Ḱ
is? Does it need a tooltip?
Here is where it was mentioned--
SP = Saddharmapuṇḍarīka, ed. Kern and Nanjio, St. Petersburg, 1912, abbreviated KN; supplementary references to ed. of Wogihara and Tsuchida, Tokyo, 1934—35, abbreviated WT; fragments of ‘Kashgar’ or Central Asiatic recension, ed. Thomas and Lüders, ap. Hoernle, MR 133 ff., 144 ff.; others, ed. LaVallée Poussin, JRAS 1911, 1070 ff.; transl. Burnouf (Lotus de la Bonne Loi, Paris, 1852), and Kern (SBE 21, Oxford, 1884). Tibetan citations chiefly from block-print in my possession, partly from WT. When my work was practically ready for print, my colleague Professor Rahder received, and lent to me, the photostatic reproduction of the ms. referred to by WT as Ḱ. It has been cited a very few times. The quotations from it in WT seem to be very inaccurate.
I don't think we need to go much further (like looking into the WT's ed. of SP for knowing the actual detail)! [If, at all, a tooltip is to be provided, it could be "WT's Ḱ manuscript", with this info at hand.]
Thanks., Using this tooltip
Ḱ ms,1,FR1 Wogihara and Tsuchidaʼs Ḱ manuscript of Saddharmapuṇḍarīka
I do think (1) and (2) should be completed sometime: such markup will be useful to the analysis of bhs by some future scholar.
@funderburkjim
I see that you had used some word-lists of German and French, in this BHS repo for some analytics.
Would you pl. try a programmatic approach to mark the french and german words (using these lists) in the BHS.txt? [I think, I had marked all the Tibetan words in my posted file.]
Revise bhs-meta2.txt and bhsheader.xml.
Ref: issues/issue1/meta directory. Also litsrc directory (08-09-2023 notes).
Revised display url: https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1c/web/.
Revised version of bhs now installed at cdsl.
Time to close this issue.
The xml header file has
<docAuthor>by Franklin Edgerton, Serling Professor of
which should've been<docAuthor>by Franklin Edgerton, Sterling Professor of