BHS issues- Andhrabharati

Andhrabharati commented 2 years ago

The xml header file has <docAuthor>by Franklin Edgerton, Serling Professor of which should've been <docAuthor>by Franklin Edgerton, Sterling Professor of

Andhrabharati commented 2 years ago

The metalines have quite many entries with = in k2-field. These need to be appropriately corrected.

Andhrabharati commented 2 years ago

The abbr and ls list is missing in the digitisation.

The same is done now (from the part-1 of the work) and posted here. BHS Grammar_Front pages.txt

Andhrabharati commented 2 years ago

Many entries seem to have composite words listed under the entry, without any such notation/marking.

All those composite words could be made "prominent".

Andhrabharati commented 2 years ago

Just as in BUR, many grouped entries (OR, AND, ALSO, CSV lists) are seen in this BHS as well, which could appropriately be "handled" as in MW.

Andhrabharati commented 2 years ago

Now I've started doing all these and many more corrections/changes in the BHS text.

Shall be posting the file, once I am done with the work.

gasyoun commented 2 years ago

Sterling

Good catch

drdhaval2785 commented 2 years ago

Done, @Andhrabharati ?

Andhrabharati commented 1 year ago

@funderburkjim

Here is my version of BHS, marking various tags like abbr.s, acr.s and ls entities.

BHS-AB (Tib done).zip

[I guess you would find this file usable at CDSL as is.]

Also I had tried to mark the Tibetan text-strings within italics. Similar exercise was started for French and German text-strings, but is not done fully yet. If this markup makes some sense (and has any benefit), we can resume this part and complete in a short time.

PS. I had earlier posted above the abbr.s and ls entities listed in the first volume of Edgerton (The BHS Grammar), which is common with the second volume (The BHS Dictionary) as well.

funderburkjim commented 1 year ago

@Andhrabharati What is 'acr' element ?

466 matches in 427 lines for "" in buffer: temp_bhs_ab_1.txt

Andhrabharati commented 1 year ago

acronym.

Andhrabharati commented 1 year ago

I had marked the words that appeared as "informal" type (non-standard ?) as acronyms, though I know that they aren't acronyms "by definition".

funderburkjim commented 1 year ago

<ms> is Manuscript?

Andhrabharati commented 1 year ago

Yes.

funderburkjim commented 1 year ago

Additions to dtd for bhs

<!ELEMENT tib (#PCDATA) > <!-- Tibetan text, bhs-->
<!ELEMENT ger (#PCDATA) > <!-- German text, bhs-->
<!ELEMENT fr (#PCDATA | i | ab)* > <!-- French text, bhs-->
<!ELEMENT ed (#PCDATA) > <!-- edition bhs-->
<!ELEMENT ms (#PCDATA) > <!-- manuscript bhs-->
<!ELEMENT lat (#PCDATA) > <!-- latin text bhs-->
<!ELEMENT toch (#PCDATA) > <!-- Tocharian text bhs-->

<!ELEMENT acr (#PCDATA) >  NOTE: removed per discussion below

funderburkjim commented 1 year ago

duplicate lex/ab markup

There are 6 cases where a given (global) abbreviation is marked in two ways in bhs

check_dups found dup abbrev: "acc."
  OLD  ab acc.::932
  NEW  lex acc.::615
check_dups found dup abbrev: "accs."
  OLD  ab accs.::1
  NEW  lex accs.::6
check_dups found dup abbrev: "conj."
  OLD  ab conj.::1
  NEW  lex conj.::4
check_dups found dup abbrev: "derivs."
  OLD  ab derivs.::1
  NEW  lex derivs.::24
check_dups found dup abbrev: "n."
  OLD  ab n.::6542
  NEW  lex n.::45
check_dups found dup abbrev: "nom."
  OLD  ab nom.::2
  NEW  lex nom.::110
check_dups found 6 duplicate abbrevations

Should the file be changed to use just one markup?

Andhrabharati commented 1 year ago

Should the file be changed to use just one markup?

I did this markup long back, before I ventured upon looking FULLY at any other work; in recent times, I started marking the abbr.s as local and global types, if they have different expansions.

OLD ab acc.::932 --> this denotes according NEW lex acc.::615 --> this denotes accusative

OLD ab accs.::1 --> this can be changed to lex type NEW lex accs.::6

OLD ab conj.::1 --> this can be changed to lex type NEW lex conj.::4

OLD ab derivs.::1 --> this can be changed to lex type NEW lex derivs.::24

OLD ab n.::6542 --> this denotes name NEW lex n.::45 --> this denotes nominative

[But, the n. in <ab>n. act.</ab>, <ab>n. ag.</ab>, <ab>n. pl.</ab>, <ab>n. pr.</ab>, <ab>n. sg.</ab> is to be expanded as nominative.]

OLD ab nom.::2 --> this can be changed to lex type NEW lex nom.::110

Andhrabharati commented 1 year ago

@funderburkjim

Just noted that this repo has "eng_error_lang" folder having german, french and tibetan words listed in separate files. [I did my markups independently earlier.]

Thinking of resuming the <fr and <ger markup using your lists, had a quick look at the above 3 files.

Seen that the french list has für 5 (which is a german word and is in german list as well), and svastika 1 (which is a sanskrit word).
My Tibetan marking covered much more words than listed in your file.

Do you see any point in continuing this markup, as I had asked above?

gasyoun commented 1 year ago

My Tibetan marking covered much more words than listed in your file.

Where is yours?

Andhrabharati commented 1 year ago

ìnside the file.

funderburkjim commented 1 year ago

abbreviation resolution/display

Work materials present in the issues/issue1/ folder.

The displays for bhs have been modified to utilize the various tooltips. Current dev1 version of displays at https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1/web/

The main thing I had to do was match the abbreviation material of the BHS front matter with the markup of the revised bhs.txt. Two files show the matching:

match_ab_final.txt for the <ab>, <lex>, <lang> tags. In the displays, the lex and lang tags tooltips are considered part of the general abbreviations 'ab' tag tooltips.
match_ls_final.txt

These files are tsv (tab-separated-values) files with 3 fields:

abbreviation
info has 3 subfields
- tag
- count : number of instances in bhs.txt
- in ls, count may be 0, indicating a front-matter abbreviation with no bhs.txt instances.
- source of tooltip:
  - FR0 from the front-matter of printed text
  - FR1 inferred
tooltip
- NOTE: in ls, the tips marked with curly brackets are 'pointers' to other abbreviations. e.g. {AbhidhK.} means use tooltip for abbreviationAbhidhK.which isAbhidharmakośa, transl. LaVallée Poussin ...`.
- In ls, the tip may be '?' -- these are not resolved.

The main thing remaining to be done is to examine the tips with a '?'. There are many (450+) of these abbreviations for ls, although these represent only 1150+ instances in bhs; by contrast, 47115 ls instances have an assigned tooltip -- so these unassigned abbreviations represent less than 2% of the instances.

@Andhrabharati If you decide to examine those '?', the simplest way to communicate the results back to me would be for you to edit (copies of) the match_X_final.txt files, from which a program extracts the tooltip file used in displays.

funderburkjim commented 1 year ago

@Andhrabharati If you make revisions in your bhs.txt digitization, please take into account the changes which I have made in the version you uploaded. These changes are in file change_1_ab.txt (23 lines changed). A small number of possible changes are mentioned at 'Possible mis-labeling' in litsrc/readme.txt.

gasyoun commented 1 year ago

unassigned abbreviations represent less than 2% of the instances.

Not a big issue after all.

Andhrabharati commented 1 year ago

@Andhrabharati If you decide to examine those '?', the simplest way to communicate the results back to me would be for you to edit (copies of) the match_X_final.txt files, from which a program extracts the tooltip file used in displays.

Sure @funderburkjim , I can have a look at those; but probably after a day or two more-- currently I am making the PWG in the same format as pwk and pwkvn, so that all three "as a set" could go together in uniform manner; at the moment these three are in three different formats!)

Andhrabharati commented 1 year ago

unassigned abbreviations represent less than 2% of the instances.

Not a big issue after all.

You are wrong @gasyoun !

Jim didn't put his statement properly to 'reach' you. 450 (though many are just variants of a few) items [leave aside their occurrences!!] out of 1026 is NOT a small percentage.

Andhrabharati commented 1 year ago

A small number of possible changes are mentioned at 'Possible mis-labeling' in litsrc/readme.txt.

Here are my responses point-wise--

1 <lang>JM</lang> -> <ls>JM</ls> Jaina Māhārāṣṭrī. 41 <lang>JM.</lang> -> <ls>JM.</ls> Jaina Māhārāṣṭrī.

;;AB this is language only, being the Jaina variant of the Māhārāṣṭrī Prākrit.

3 <lang>Mg.</lang> -> <ab>Mg.</ab> meaning

;; AB under L-5940, this indicates the language Māgadhi; while at the other two places it is <ab>Mg.</ab> : meaning, which could be made as <ab>mg.</ab> (1000+ times)

2 <ab>Bodhicaryāv.</ab> -> <ls>Bodhicaryāv.</ls>

;; AB agreed; it was an error on my side.

1 <ab>Dh</ab> -> Dh (no markup)

;;AB it should be made as <ls>Dh</ls> at L-12899 (which again occurs at L-12899 in such form), whereas <ls>Dh.</ls> occurs at all other places (38 times)

2 <ab>P.K.</ab> -> <ab n="Pūraṇa Kāśyapa">P.K.</ab>

;;AB agreed

antareṇa , instr. of n. used as adv. and prep. neither 'name' nor 'nominative' tooltip makes sense ?

;;AB here n. could to be taken as 'noun'; see the term 'noun' being used to describe it in this L-1154 entry (and in the prev. one L-1153).

[In grammar, the instrumental case (abbreviated INS or INSTR) is a grammatical case used to indicate that a noun is the instrument or means by or with which the subject achieves or accomplishes an action. The noun may be either a physical object or an abstract concept.]

So, can we mark it as <ab n="noun">n.<ab> (a local abbr.) instead?

Andhrabharati commented 1 year ago

The displays for bhs have been modified to utilize the various tooltips. Current dev1 version of displays at https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1/web/

@funderburkjim

Just checked for the <acr> cases and, they are not 'used'.

<acr>altho</acr>: although (21) <acr>Altho</acr>: Although (1) <acr>qy</acr>: query (10) <acr>Sktism</acr>: Sanskritism (34) <acr>Sktization</acr>: Sanskritization (61) <acr>Sktized</acr>: Sanskritized (38) <acr>tho</acr>: though (222) <acr>thru</acr>: through (79)

Have you opted to ignore them?

funderburkjim commented 1 year ago

<acr> cases and, they are not 'used'.

Unfortunately, I forgot about your introduction of this new markup tag, as still another abbreviation. Now I see that you mentioned acr in this comment above.

I will attend to this.

funderburkjim commented 1 year ago

The 'acr' examples are NOT acronyms:

An acronym is an abbreviation that is formed by taking the initial letters of the words in a phrase and creating a new word that is pronounceable.

Some common acronyms include ASAP (As Soon As Possible), BAE (Before Anyone Else), FOMO (Fear Of Missing Out), GIF (Graphics Interchange Format), LOL (Laughing Out Loud), PIN (Personal Identification Number), TTYL (Talk To You Later) and YOLO (You Only Live Once)¹³. These acronyms are used in various contexts such as pop culture, chat, military and government¹⁴⁵.

Source: Conversation with Bing, 8/6/2023

If we choose to mark them in some way that will permit tooltips, why not just use the 'ab' tag?

Andhrabharati commented 1 year ago

The 'acr' examples are NOT acronyms

I knew this and mentioned the same in my above post, @funderburkjim . [Seeing such forms in a formal dictionary is a bit awkward; so I thought these should be "marked" somehow.]

These are not even slang words(!!), but are short forms used in informal language; seen that these are appearing in the urbandictionary.com.

Yes, we sure can put them under ab-tag.

Andhrabharati commented 1 year ago

@funderburkjim

Hope you had noted my opening post here, to correct while making this work 'public'. Probably this BHS work could go 'public' now itself, without waiting for my resolving the pending '?' tooltips.

funderburkjim commented 1 year ago

minor revisions

I think these take into account the details mentioned since this comment above.

Revised display url: https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1a/web/ Revised bhs.txt: temp_bhs_ab_1.zip

The current dev markup and displays could be installed now at Cologne: they improve the current cdsl displays.

@Andhrabharati You raised several questions regarding possible additional work on bhs.

resolution of tooltips for those 1000+ ls abbreviations currently marked with '?'
Completion of the <fr>, <ger>, and <tib> markup.
Just as in BUR, many grouped entries (OR, AND, ALSO, CSV lists) are seen in this BHS as well, which could appropriately be "handled" as in MW.

I do think (1) and (2) should be completed sometime: such markup will be useful to the analysis of bhs by some future scholar.

Whether these revisions are done now or later depends on your schedule. If you decide to do this in the near future (by modifying temp_bhs_1_ab.txt), then I'll install that revision into cdsl. Otherwise, I'll install the current temp_bhs_1_ab.txt into cdsl.

(3) has less immediate importance. It can be done later.

One other comment for possible future revision will be mentioned in another comment below.

funderburkjim commented 1 year ago

line-breaks absent

In the revised temp_bhs_1_ab.txt version, the line breaks have been removed - the text for each entry is joined into one text 'blob'. We have generally tried to preserve line-break information in those digitizations which originally honored line breaks. This has some utility in tracking down user corrections.

I think it is ok to forget printed line breaks, as has been done here in bhs..
For instance, it makes identification of patterns (e.g. ls references) easier.

See #2 for an idea how to address the long-lines currently in bhs digitization.

Andhrabharati commented 1 year ago

I would like to suggest showing the tib-, fr- and ger- strings differently from the rest (eng-) of the text [may be in a different colour/background, say yellow]; this would immediately catch/attract the user's eye.

Without such rendering, the markup just lies hidden inside the text file (which NOT many users would ever 'know'); and I do not see much benefit taking up the work-(2) to 'completion'.

Andhrabharati commented 1 year ago

Pl. push this file to cdsl, as I am still "meddling" with PWG; and it might take few more days to close this work to my satisfaction.

Andhrabharati commented 1 year ago

one small correction-- the tooltip for Skt. is now Sanskri Language, with t missing.

Andhrabharati commented 1 year ago

@Andhrabharati You raised several questions regarding possible additional work on bhs.

I had mentioned about marking the composite words as well, in addition to the grouped words.

funderburkjim commented 1 year ago

revisions, continued

Revised display url: https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1b/web/.

No changes to temp_bhs_ab_1.

Corrected tooltip mentioned above.

Add 'ab' tooltips for the `<ed>` markup:

See litsrc/readme.txt at '08-08-2023 revision' / 'match_ab_final.txt' Note AnSS tooltip not known.

display of tib, ger, fr

Now appears in brown color. Noticed a couple of unmarked German text fragments.

italic text font

cdsl (and the prior dev1a displays) use oldstandard_i font for italic text (<i>X</i>). However, this displays as not only italic but also bold. I changed this so oldstandard_i font is NOT used, so italic text is just italic. I think this looks better. Note: This will also apply to other dictionaries.

funderburkjim commented 1 year ago

@Andhrabharati <ms>Ḱ</ms> occurs 57 times (no other <ms>), Do you know what Ḱ is? Does it need a tooltip?

Andhrabharati commented 1 year ago

Here is where it was mentioned--

SP = Saddharmapuṇḍarīka, ed. Kern and Nanjio, St. Petersburg, 1912, abbreviated KN; supplementary references to ed. of Wogihara and Tsuchida, Tokyo, 1934—35, abbreviated WT; fragments of ‘Kashgar’ or Central Asiatic recension, ed. Thomas and Lüders, ap. Hoernle, MR 133 ff., 144 ff.; others, ed. LaVallée Poussin, JRAS 1911, 1070 ff.; transl. Burnouf (Lotus de la Bonne Loi, Paris, 1852), and Kern (SBE 21, Oxford, 1884). Tibetan citations chiefly from block-print in my possession, partly from WT. When my work was practically ready for print, my colleague Professor Rahder received, and lent to me, the photostatic reproduction of the ms. referred to by WT as Ḱ. It has been cited a very few times. The quotations from it in WT seem to be very inaccurate.

I don't think we need to go much further (like looking into the WT's ed. of SP for knowing the actual detail)! [If, at all, a tooltip is to be provided, it could be "WT's Ḱ manuscript", with this info at hand.]

funderburkjim commented 1 year ago

Thanks., Using this tooltip Ḱ ms,1,FR1 Wogihara and Tsuchidaʼs Ḱ manuscript of Saddharmapuṇḍarīka

Andhrabharati commented 1 year ago

I do think (1) and (2) should be completed sometime: such markup will be useful to the analysis of bhs by some future scholar.

@funderburkjim

I see that you had used some word-lists of German and French, in this BHS repo for some analytics.

Would you pl. try a programmatic approach to mark the french and german words (using these lists) in the BHS.txt? [I think, I had marked all the Tibetan words in my posted file.]

funderburkjim commented 1 year ago

Revise bhs-meta2.txt and bhsheader.xml.

Ref: issues/issue1/meta directory. Also litsrc directory (08-09-2023 notes).

Revised display url: https://sanskrit-lexicon.uni-koeln.de/work/bhs-dev/dev1c/web/.

funderburkjim commented 1 year ago

Revised version of bhs now installed at cdsl.

Time to close this issue.

sanskrit-lexicon / BHS