sanskrit-lexicon / AP90

Research work on Apte Sanskrit-English Dictionary of 1890
GNU General Public License v3.0
0 stars 0 forks source link

Misc. markup corrections #15

Open funderburkjim opened 3 years ago

funderburkjim commented 3 years ago

While working with the Devanagari markup changes (#14), other similar potential changes noticed elsewhere in ap90.txt. This issue devoted to such cleanup.

funderburkjim commented 3 years ago

Missing bold markup on sense '1'

When there are multiple senses for an entry, the usual convention of ap90 is that they are indicated in the printed text by sequential bold numerals; and all but the first are preceded by a long-dash. image

In a fairly small number of cases (such as this), the ap90.txt digitization is missing the bold markup for the '1':

<L>5258<pc>0201-a<k1>aSizWa<k2>aSizWa
{#aSizWa#}¦ {%<ab>a.</ab>%} 1 Eating much. {@--2@}
<><ab>Ved.</ab> Reaching very far. {#--zWaH#} Fire.
<LEND>

The change to be made in this example is the current ap90 convention for bold-markup {@X@}

{#aSizWa#}¦ {%<ab>a.</ab>%} {@1@} Eating much. {@--2@}

Now there are many other 1 (space+1+space) instances elsewhere which we don't want to mark up, such as with verb entries, where '1' indicates the conjugation class.

{#aMh#}¦ 1 <ab>A.</ab> {#aMhate, aMhituM#}

One way to get at most of the 1's which need to be marked bold is to use the context of the aSizWa example. \(¦.*%}\) 1 -> \1 {@1@}.
[Note the \(...\) syntax is peculiar to Emacs; Python would not require the escape \ before the capturing parentheses.]

This changes 78 cases.

Andhrabharati commented 3 years ago

Just like to inform that I did all these corrections and many more on AP90 data, during last 3-4 days.

Most probably my work will be over in another few days, and the final format would be similar to my earlier work on AP57 (DSAL) data.

However, it is in a drastically different form than the Cologne style!!

Andhrabharati commented 3 years ago

(Fully split and expanded, similar to the way MW99 is done).

Pure unicode text, and no Cologne style tagging.

funderburkjim commented 3 years ago

Will be interested to see your work -- what you started with and what you end up with.

funderburkjim commented 3 years ago

a few addition bold 1

additional_bold_1.txt

Commit.

gasyoun commented 3 years ago

[Note the (...) syntax is peculiar to Emacs; Python would not require the escape \ before the capturing parentheses.]

:robot:

Pure unicode text, and no Cologne style tagging.

Right, no tags. But actually the bolding can be converted to tags as well.

Andhrabharati commented 3 years ago

Will be interested to see your work -- what you started with and what you end up with.

I started with the AP90 XML file downloaded in 2018.

The <ls> entry comparision is given in this file below. AP90_ls_comparision.txt The AB count is more when the ending dot is either missed or typed as comma (and corrected). Need to see why @funderburkjim got more counts at some items. (At the few places that I checked, they are far off from the xml file counts.)

And here are the Abbr.s culled out from my work. AP90_abbr.txt

[I did not work on the Appendix. I (Prosody) and it might give some more additions to these lists.] --------------------------- And the observations while I was on the task, are as below- Data errors missing/extra मात्रा missing बिन्दु त-न, थ-य, ध-व, प-ष, ब-व, भ-म, ... errors δ Virginis typed as Ś viriginis at the entry 'ap'.

Book errors "Comp." header missed at some places Square bracket and round parenthesis pairing missed at some places

I would like to proof the HWs portion next, as they have the major role in search and comparision with other dictionaries.

gasyoun commented 3 years ago

HWs portion next, as they have the major role in search and comparision with other dictionaries.

Totally for it, thanks.

Andhrabharati commented 3 years ago

Missed the notes data in the ls comparison file posted above. image

Andhrabharati commented 3 years ago

There is a probable need to have a Devanagari abbr. file too in this Bi-lingual work!

Some are listed in the file below- Head-abbr (Skt).txt

Andhrabharati commented 3 years ago

This is a sample section from the file I am working on- image

The column "Comp. HW (Book)" is for reference till the HWs are fully expanded, and would be removed finally as the same gets into the expanded column.

The 31K L-entries of Cologne data have now become 70K+ with the expansion of primary words and comp. word(group)s. [The groups are the comma separated lists as are visible in the above screenshot, which could be separated out further, optionally.]

On the whole, I am doing a little extra work now, as compared to the work done previously on AP57.

Andhrabharati commented 3 years ago

@funderburkjim I recall asking you a similar query in MW work earlier and remember you saying "there is no distinction between the two (at present)".

The point is about the abbr. and <ls> having same item.

Pl. see the below under घृताची f.

  1. N. of an apsaras; N. 2. 109

The first N. is the abbr. for Name and the 2nd N. is an <ls> for Naiṣadha.

There are many such places where the distinction is clearly called for. [This applies to all the works, not just MW and AP90!!]

Hope you would understand the need and do something about this.

gasyoun commented 3 years ago

till the HWs are fully expanded

I believe you will become part of the Reverse Dictionary of Sanskrit project.

The 31K L-entries of Cologne data have now become 70K+ with the expansion of primary words and comp. word(group)s.

For 7 years I was waiting for @Andhrabharati to come. You made me wait for seven long years.

Andhrabharati commented 3 years ago

What is this "Reverse Dictionary of Sanskrit project"?

gasyoun commented 3 years ago

Reverse Dictionary of Sanskrit project

All headwords from all Cologne + other dictionaries in reverse order.

1095

Andhrabharati commented 3 years ago

OK; this is something similar to the traditional Skt. (verse) Dictionaries, which arrange the entries in the order of ending letters,; and some go further in arranging them in starting letter order and into letter counts as well.

I have no idea to do any such work, or be part of such a project, if it exists (or on-going) somewhere.

gasyoun commented 3 years ago

I have no idea to do any such work

I do.

if it exists (or on-going) somewhere

There are two attempts known.

similar to the traditional Skt. (verse) Dictionaries, which arrange the entries in the order of ending letters

Exactly.

funderburkjim commented 3 years ago

N. of an apsaras; N. 2. 109

There are two sets of 'abbreviations'. For ap90, these are in files

In your example, the 'N. 2. 109' is incorrectly marked in ap90.txt:

OLD
<>Sarasvatī. {@--3@} <ab>N.</ab> of an {%apsaras%}; <ab>N.</ab>
<>2. 109 (the following are the <lbinfo n="prin+cipal"/>
NEW
<>Sarasvatī. {@--3@} <ab>N.</ab> of an {%apsaras%}; <lbinfo n="ls:N.+2."/>
<><ls>N. 2. 109</ls> (the following are the <lbinfo n="prin+cipal"/>

Note this is an 'isolated' case -- there are many correct markings of N. as literary source, such as in apahnu where 'N. 1. 49.' is marked in ap90.txt as <ls>N. 1. 49.</ls> and displays show tooltip Naiṣadhacarita.

I'll correct the N. 2. 109 as above. I've corrected many cases where a literary source occurs at a line break, but obviously missed this one, as they are hard to identify; for instance there are 300+ similar 'N.' at end of line; vast majority are of the <ab> variety. but I'll look for others.

Please do comment on other such mis-markings as you notice them.

funderburkjim commented 3 years ago

Found one other case like above. See commit 9256711.

There was also a questionable case, that was not changed:

; <L>6071<pc>0232-c<k1>AtatAyin<k2>AtatAyin  ??
;53298 old <>{#kzetradAraharaScEtAn zaq vidyAdAtatAyinaH ..#} Śukra <ab>N.</ab>
;53298 new <>{#kzetradAraharaScEtAn zaq vidyAdAtatAyinaH ..#} Śukra <ab>N.</ab>
;
;53299 old <>{#ºtA, --tvaM#} murdering, stealing, <lbinfo n="destroy+ing"/>
;53299 new <>{#ºtA, --tvaM#} murdering, stealing, <lbinfo n="destroy+ing"/>

This is odd, in that AP57 does not have the 'N.' at all.

gasyoun commented 3 years ago

This is odd, in that AP57 does not have the 'N.' at all.

In this particular case?

Andhrabharati commented 3 years ago

He is talking about this particular case only.

I guess it should be there as well; it is denoting Śukra Nītisāra. [It is marked by me as <Śukra N.> in my working.]

Andhrabharati commented 3 years ago

Here are the final versions of abbr. and ls (works and persons) files of AP90. [Now the abbr.s list is covering the strings like Dr. and Mr. also]

In addition to a space and comma instead of a dot, there are ; and ) also at the end of strings in the typed data. [This also is taken care of now.]

AP90_abbr (AB).txt AP90_ls (AB).txt

[Still the App. I is not done yet.]

The reason for more counts in Jim's version seem to be due to his "padding" the strings to the seq. of numbers in the following text.

Andhrabharati commented 3 years ago

And here is the AP90 data as of now. AP90_ v.1 (for Cologne).txt I feel this is in a good readable form, by machine and/or human eyes.

Hope this would give some ideas to @funderburkjim, even if he does not like to use this as is (with minor modifications to bring it into Cologne style.)

Notes for AB splits and markings.

  1. To match with book content, ṣ is changed as sh & ṛ as ṛi; which is the general use among the literature.
  2. The abbr. are marked as {...}.
  3. The ls names and works are marked as <...>.
  4. It may be noted that under the verbal and dhātu entries, the prefixed comp. words are separated (marked −With xxx), as are the suffixed comp. words separated in the substantive entries.
  5. The citations are demarked with semicolon (almost) consistently when there is a work or section change; and with a comma or a dash when the citation is within the section.
  6. The meaning numbers are consistently marked −nn. [the first one having no preceding ]; though the book is not having the dot, felt it would look better with a dot.
  7. The Dhātu class numbers are not followed by a dot.
  8. The different flavor of meanings for substantives or a different class (or A., P., U.) for verbals are marked with Roman numbers followed by a dot.

MW style adaption by Apte.

  1. It may be seen that at few places the same entry when present as both a subst. or verb., they are marked with Roman numbers in the book, but not in the Cologne data. This is inserted at some places while "reading" the matter. [MW had followed similar style almost everywhere; but Apte did not do so fully.]
  2. The entry order is not strictly alphabetic, but dhātu based as in MW.

New style by Apte.

  1. The Parent group of entries [a. m. f. n. ind.] are given first, followed by adverbial entries (either marked as such or not) and then the comp. words (in reduced forms) are given. [In the AB work, the primary group of words are marked as <p> and the comp. words by <+>; and one could contextually guess what the other two markings (<c> and <x>) stand for.]
  2. For some comp. entries, the expanded form is also given (when it was felt some spl. attention is needed to form them).
  3. As done in Vacaspatyam, new citations are culled out from the classic and later works, as the students come across them more frequently than the Vedic literature. (Incidentally, majority of these are taken from Vacaspatyam!)

Next action required is to join the word breaks due to non-hyphenated line-breaks, and to proofread atleast the HW portions.

Now, I take a small break from this Apte's work. [Though I started the work with an intention to expand the comp. words, just that part of work is not done as the need for proofing the HWs is seen.]

gasyoun commented 3 years ago

though the book is not having the dot, felt it would look better with a dot.

Never say that to Jim ))

[MW had followed similar style almost everywhere; but Apte did not do so fully.]

Valuable comment, was not aware of Apte's approach.

started the work with an intention to expand the comp. words, just that part of work is not done as the need for proofing the HWs is seen

Understood. I'm thrilled to see what's done. A labor of love.

Andhrabharati commented 3 years ago
5. The meaning numbers are consistently marked **−nn.** [the first one having no preceding **−**]; though the book is not having the dot, felt it would look better with a dot.

Pl. see the following text- image

There are several places in the book, where the meanings are referred with a dot; this has prompted me to use the dot throughout for the meanings.

@funderburkjim, don't you think we should have some mechanism to highlight these cross-referred meaning numbers, say rendering them bold as done at the actual meaning places?

And I am marking such statements by //, indicating to display them in next line as- −6. {N.} of one of the Ādityas.//The senses of ‘party’, ‘a share of booty’, ‘earnest money’, which are said to occur in the Veda are traceable to <b>1.</b> above.

Now the display would be more meaningfully read, as

−6. {N.} of one of the Ādityas.

The senses of ‘party’, ‘a share of booty’, ‘earnest money’, which are said to occur in the Veda are traceable to 1. above.

What do you feel about this?