CAE meta-line/IAST conversion

funderburkjim commented 6 years ago

This issue devoted to details regarding the conversion of Cologne digitization (cae.txt) of Cappeller Sanskrit-English Dictionary.

funderburkjim commented 6 years ago

alternate headwords pre-coded

Thomas apparently identifed a certain pattern which indicates what we would call an alternate headword. And, he generated separate entries for these.

Here is the first example, in the original representation:

.{#aMSakalpanA#}1{#aMSakalpanA#} ·f.  allotment of shares.
.{#aMSaprakalpanA#}1{#aMSa°prakalpanA#} ·f. allotment of shares.

With this coding, it requires some fragile programming to possibly detect that the original text showed these as alternate spellings. [I mean that we could check the text of consecutive digitization entries, and if they are identical, or almost identical, then the pair likely would appear in the text as alternates connected by an ampersand &. But the coding of the next instance shows that entries are not exactly identical, although they are close:

.{#aMSaBAgin#}1{#aMSaBAgin#} ·a.  having a share, partaking of (gen. or
--°).
.{#aMSaBAj#}1{#aMSa°BAj#} ·a. having a share, partaking of (gen. or --°).

I'll make no attempt here to do this reverse engineering. After all, the main utility (to have the entry searchable by both alternate spellings) is present in the coding Thomas has given us.

funderburkjim commented 6 years ago

Alternate headword and 'soft hyphen'

From prior work on extended Ascii characters in cae.txt (in cae-meta.txt), there is a rare unicode character that identifies the first of two alternate headwords. This is \u00ad , a SOFT HYPHEN. It doesn't show up in the default font used in the browser, so you can't see it in the two examples above.

There are 2674 lines that have this special character, so that would be the best way to identify alternate spelling pairs, if someone ever needed to do so.

Because the functionality of this is 'meta' information about an entry, I'm removing that character, and noting its former presence by the <e>firstalt addendum to the meta line.

For example,

<L>4<pc>001<k1>aMSakalpanA<k2>aMSakalpanA<e>firstalt
{#aMSakalpanA#}¦ ·f. allotment of shares.
<LEND>

gasyoun commented 6 years ago

After all, the main utility (to have the entry searchable by both alternate spellings) is present in the coding Thomas has given us.

Exactly.

Because the functionality of this is 'meta' information about an entry, I'm removing that character, and noting its former presence by the firstalt addendum to the meta line.

And add it to a readme, right?

funderburkjim commented 6 years ago

add it to a readme,

Will mention it in cae-meta2.txt

gasyoun commented 6 years ago

in cae-meta2.txt

More that enough if one knows where to look for.

funderburkjim commented 6 years ago

Abbreviations in Cappeller

funderburkjim commented 6 years ago

Conversion finished.

The main differences visible in the displays:

abbreviation tooltips. Cappeller's dictionary makes heavy use of abbreviations. The ones in the table above appear with markup <ab>X</ab> in cae.txt. And there is a table of abbreviation expansions which the displays use to generate tooltips. There are a few other abbreviations, not appearing in the table, and not currently marked. There will also be some abbreviations that are unmarked despite being in the table. Such instances can be amended when noticed.
divisions on verb prefixes. Most verbs are identified in cae.txt with a <vlex type="root"/> markup. This does not currently appear in the displays, but can be used for filtering; about 1000 such entries are marked as roots in this way. In 550 or so of these root entries, there are sub-entries pertaining to the root with various prefixes. Using certain contextual regular expressions made it possible to add markup <div n="p"> just before these prefixes. Thus far, this has been used to provide some semantic improvement in the displays (line breaking on these prefix subentries). They could be used later to add search terms (e.g. the prefix upa under gam would yield additional subword search term upagam).

IAST was used sparingly in the text, but the AS coding of such has been changed to modern IAST.

funderburkjim commented 6 years ago

A candidate for beginner dictionary

My impression after working a bit with Cappeller's Sanskrit English dictionary is that it may have merit as a dictionary for early Sanskrit students. The entries are generally short, and not burdened with literary source references and arcane inflected forms.

Review the alternate entries

As mentioned above, this digitization has often duplicated entries for headwords showing alternate spellings. The <e>firstalt markup (of current meta-line format) can be used to identify these. There are about 2700 cases.

The first aspect of such a review would be to re-introduce into each alternate entry all the alternate spellings. Consider the first example:

<L>4<pc>001<k1>aMSakalpanA<k2>aMSakalpanA<e>firstalt
{#aMSakalpanA#}¦ <lex>f.</lex>  allotment of shares.
<LEND>

<L>5<pc>001<k1>aMSaprakalpanA<k2>aMSa°prakalpanA
{#aMSa°prakalpanA#}¦ <lex>f.</lex> allotment of shares.
<LEND>

The first entry does not mention aMSaprakalpanA and the second doesn't mention aMSakalpanA. So there is information loss which should be remedied.

It might be better and simpler to coalesce such pairs of alternate entries into their original one-entry form, and to express the alternate spellings via the 'alternate headword' technique.

gasyoun commented 6 years ago

Using certain contextual regular expressions made it possible to add markup just before these prefixes.

perfect

They could be used later to add search terms (e.g. the prefix upa under gam would yield additional subword search term upagam).

Too hard to do right now?

My impression after working a bit with Cappeller's Sanskrit English dictionary is that it may have merit as a dictionary for early Sanskrit students.

That's why it has become the basis for the Sanskrit-Russian dictionary of 1978 as well.

It might be better and simpler to coalesce such pairs of alternate entries into their original one-entry form, and to express the alternate spellings via the 'alternate headword' technique.

So we remain aware that there is more? But it's not convenient in the search.

funderburkjim commented 6 years ago

Here is cae-meta2

funderburkjim commented 6 years ago

So we remain aware that there is more?

So that a user is aware that the author viewed two different spellings as essentially the same, as synonyms.

Sanskrit-Russian dictionary of 1978

Interesting. Is that the Kochergina dictionary you've mentioned previously. Is there a link to the 1978 dictionary?

funderburkjim commented 6 years ago

Too hard to do right now ?

The main technical obstacle that comes to mind is the derivation of the spelling of the prefixed verb from (a) the prefix and (b) the root. Work on this was done by Peter and me back in 2008; a sample of the analysis is (from MW prefixed verbs):

3260    atizkand    ati-zkand   skand   ati ati+skand

The similar analysis results for all 6100+ prefixed verb entries from MW are in this file: verb-prep4-gati2-complete.txt

(note to ejf: local path of this file is C:\ejf\LexFund\verb\preverb-2008)

I don't know whether this analysis is uploaded anywhere on Github, I don't think it is.

Anyway, That file would let is derive correct spellings of prefixed verbs from prefix and verb, even for tricky cases like atizkand. Since the file contains 6100+ prefixed verbs from MW, it should be able to handle most prefixed verbs from CAE, as well as PW, PWG. There will be differences in root spelling to contend with, of course.

So, yes, I think it is eminently doable from this technical viewpoint.

funderburkjim commented 6 years ago

But it's not convenient in the search.

Not sure what this means. Please elaborate.

gasyoun commented 6 years ago

I don't know whether this analysis is uploaded anywhere on Github, I don't think it is.

It's not, but you emailed me it, so https://yadi.sk/i/BXpyMcLybqvjo here is how I used it.

eminently doable from this technical viewpoint

What sacrifice is required from me? You name it. It's the single most wanted addition since 2014.

funderburkjim commented 6 years ago

I think that when the meta-line conversion of PWG is in place, we'll have the new form for most of the important Sanskrit-X dictionaries (PD maybe exception, but it is not complete dictionary; CCS, BOP, GRA also general purpose dictionaries). After that I will mentally put the meta-line conversion on the back burner of the stove, from its current front burner position.

Then, we can take as an early project this 'subword' task, and the root correlation problem, and many other interesting improvements.

gasyoun commented 6 years ago

Then, we can take as an early project this 'subword' task, and the root correlation problem, and many other interesting improvements.

That sounds like a plan. Off the stove!

drdhaval2785 commented 3 years ago

https://github.com/sanskrit-lexicon/COLOGNE/issues/191#issuecomment-340896553 seems to work. But current hack splits a single entry into two. Necessary @funderburkjim ? Can it be handled via the meta line only?

sanskrit-lexicon / COLOGNE