Open funderburkjim opened 6 years ago
Thomas apparently identifed a certain pattern which indicates what we would call an alternate headword. And, he generated separate entries for these.
Here is the first example, in the original representation:
.{#aMSakalpanA#}1{#aMSakalpanA#} ·f. allotment of shares.
.{#aMSaprakalpanA#}1{#aMSa°prakalpanA#} ·f. allotment of shares.
With this coding, it requires some fragile programming to possibly detect that the original text showed these as alternate spellings. [I mean that we could check the text of consecutive digitization entries, and if they are identical, or almost identical, then the pair likely would appear in the text as alternates connected by an ampersand &. But the coding of the next instance shows that entries are not exactly identical, although they are close:
.{#aMSaBAgin#}1{#aMSaBAgin#} ·a. having a share, partaking of (gen. or
--°).
.{#aMSaBAj#}1{#aMSa°BAj#} ·a. having a share, partaking of (gen. or --°).
I'll make no attempt here to do this reverse engineering. After all, the main utility (to have the entry searchable by both alternate spellings) is present in the coding Thomas has given us.
From prior work on extended Ascii characters in cae.txt (in cae-meta.txt), there is a rare unicode character that identifies the first of two alternate headwords. This is \u00ad , a SOFT HYPHEN. It doesn't show up in the default font used in the browser, so you can't see it in the two examples above.
There are 2674 lines that have this special character, so that would be the best way to identify alternate spelling pairs, if someone ever needed to do so.
Because the functionality of this is 'meta' information about an entry, I'm removing that character,
and noting its former presence by the <e>firstalt
addendum to the meta line.
For example,
<L>4<pc>001<k1>aMSakalpanA<k2>aMSakalpanA<e>firstalt
{#aMSakalpanA#}¦ ·f. allotment of shares.
<LEND>
After all, the main utility (to have the entry searchable by both alternate spellings) is present in the coding Thomas has given us.
Exactly.
Because the functionality of this is 'meta' information about an entry, I'm removing that character, and noting its former presence by the
firstalt addendum to the meta line.
And add it to a readme, right?
add it to a readme,
Will mention it in cae-meta2.txt
in cae-meta2.txt
More that enough if one knows where to look for.
The main differences visible in the displays:
<ab>X</ab>
in cae.txt. And there is a table of abbreviation
expansions which the displays use to generate tooltips. There are a few other abbreviations, not
appearing in the table, and not currently marked. There will also be some abbreviations that are
unmarked despite being in the table. Such instances can be amended when noticed.<vlex type="root"/>
markup. This does not currently appear in the displays, but can be used for filtering; about 1000 such entries are marked as roots in this way. In 550 or so of these root entries, there are sub-entries pertaining to the root with various prefixes. Using certain contextual regular expressions made it possible to add markup
<div n="p">
just before these prefixes. Thus far, this has been used to provide some semantic improvement in the displays (line breaking on these prefix subentries). They could be used later
to add search terms (e.g. the prefix upa
under gam
would yield additional subword search term upagam
).IAST was used sparingly in the text, but the AS coding of such has been changed to modern IAST.
My impression after working a bit with Cappeller's Sanskrit English dictionary is that it may have merit as a dictionary for early Sanskrit students. The entries are generally short, and not burdened with literary source references and arcane inflected forms.
As mentioned above, this digitization has often duplicated entries for headwords showing alternate spellings. The <e>firstalt
markup (of current meta-line format) can be used to identify these. There
are about 2700 cases.
The first aspect of such a review would be to re-introduce into each alternate entry all the alternate spellings. Consider the first example:
<L>4<pc>001<k1>aMSakalpanA<k2>aMSakalpanA<e>firstalt
{#aMSakalpanA#}¦ <lex>f.</lex> allotment of shares.
<LEND>
<L>5<pc>001<k1>aMSaprakalpanA<k2>aMSa°prakalpanA
{#aMSa°prakalpanA#}¦ <lex>f.</lex> allotment of shares.
<LEND>
The first entry does not mention aMSaprakalpanA
and the second doesn't mention aMSakalpanA
. So
there is information loss which should be remedied.
It might be better and simpler to coalesce such pairs of alternate entries into their original one-entry form, and to express the alternate spellings via the 'alternate headword' technique.
Using certain contextual regular expressions made it possible to add markup just before these prefixes.
perfect
They could be used later to add search terms (e.g. the prefix upa under gam would yield additional subword search term upagam).
Too hard to do right now?
My impression after working a bit with Cappeller's Sanskrit English dictionary is that it may have merit as a dictionary for early Sanskrit students.
That's why it has become the basis for the Sanskrit-Russian dictionary of 1978 as well.
It might be better and simpler to coalesce such pairs of alternate entries into their original one-entry form, and to express the alternate spellings via the 'alternate headword' technique.
So we remain aware that there is more? But it's not convenient in the search.
Here is cae-meta2
So we remain aware that there is more?
So that a user is aware that the author viewed two different spellings as essentially the same, as synonyms.
Sanskrit-Russian dictionary of 1978
Interesting. Is that the Kochergina dictionary you've mentioned previously. Is there a link to the 1978 dictionary?
Too hard to do right now ?
The main technical obstacle that comes to mind is the derivation of the spelling of the prefixed verb from (a) the prefix and (b) the root. Work on this was done by Peter and me back in 2008; a sample of the analysis is (from MW prefixed verbs):
3260 atizkand ati-zkand skand ati ati+skand
The similar analysis results for all 6100+ prefixed verb entries from MW are in this file: verb-prep4-gati2-complete.txt
(note to ejf: local path of this file is C:\ejf\LexFund\verb\preverb-2008)
I don't know whether this analysis is uploaded anywhere on Github, I don't think it is.
Anyway, That file would let is derive correct spellings of prefixed verbs from prefix and verb, even for tricky cases like atizkand. Since the file contains 6100+ prefixed verbs from MW, it should be able to handle most prefixed verbs from CAE, as well as PW, PWG. There will be differences in root spelling to contend with, of course.
So, yes, I think it is eminently doable from this technical viewpoint.
But it's not convenient in the search.
Not sure what this means. Please elaborate.
I don't know whether this analysis is uploaded anywhere on Github, I don't think it is.
It's not, but you emailed me it, so https://yadi.sk/i/BXpyMcLybqvjo here is how I used it.
eminently doable from this technical viewpoint
What sacrifice is required from me? You name it. It's the single most wanted addition since 2014.
I think that when the meta-line conversion of PWG is in place, we'll have the new form for most of the important Sanskrit-X dictionaries (PD maybe exception, but it is not complete dictionary; CCS, BOP, GRA also general purpose dictionaries). After that I will mentally put the meta-line conversion on the back burner of the stove, from its current front burner position.
Then, we can take as an early project this 'subword' task, and the root correlation problem, and many other interesting improvements.
Then, we can take as an early project this 'subword' task, and the root correlation problem, and many other interesting improvements.
That sounds like a plan. Off the stove!
https://github.com/sanskrit-lexicon/COLOGNE/issues/191#issuecomment-340896553 seems to work. But current hack splits a single entry into two. Necessary @funderburkjim ? Can it be handled via the meta line only?
This issue devoted to details regarding the conversion of Cologne digitization (cae.txt) of Cappeller Sanskrit-English Dictionary.