Closed drdhaval2785 closed 7 years ago
This is the current analysis of 96 + 4 entries in the above mentioned lines. By unambiguous_catalogue_tags - I mean the combinations which is highly not likely to occur in any Sanskrit text and therefore global replace will serve the purpose. No need to dig deeper. Ambiguous_catalogue_tags - I mean the tags which are not that easy to extract and which will require some disambiguation manually. The reason thereof is also noted below.
This is comment part of pywork/correctionwork/issue-cologne-148/catalogue_tag_addition.py script.
# Segregated manually from catalogue_names.txt (output of list_catalogues.py)
# `Copenh` is written as `Cop` in references.
# `Pet` competes with `Peters`.
# Kh, K compete with Khn.
# Bik competes with Bikzu - Common appendage to names
# Rādh occurs in name Rādhā...
# There are three Oudh in the list. Need to make some sense out of it to tag them properly. - There is nothing to be gained by separating them, after a long thought.
# BA competes with SLP1 'BA' as in BAskara.
# Gu competes with SLP1 as in 'laGu'
# Bh competes with Bhk and Bhr.
# Kāśīn competes with Kāśīnātha.
# There are two Lahore in the list. One is Lahore and another is Lahore 1882.
# SB clashes with ASB
# Devīpr competes with Devīprasāda.... - Common name
# Hz clashes with catuHzazwivAda
# Adyar Library is used like Adyar Libr in practice.
# AK competes with SLP1 AKyA etc.
# AS competes with SLP1 ASA etc
# BC has some abnormal patterns like ABC, ABCD, ABCE etc. Not sure of importance.
# Cr competes with SLP1 kfcCra...
# Edinburgh University Library is used with different short forms like Edinburgh Univ., Edinburgh Un. etc. Only 6-8 such items.
# Hpr competes with CandaHprakASa.
# Rep competes with Report and gaRepaSaYcANga.
# Śg is wrongly written as Cg in the preface.
# Tod competes with arToddyotanikA
"""
# Additional short forms from prefaces -
1. Skm - Sūktikarṇāmṛta by Śrīdharadāsa,
2. Sbhv - Subhāṣitāvali by Vallabhadeva,
3. Śp - Śārṅgadharapaddhati in Vol. 27 (1873) of the Zeitschrift of the German Oriental Society,
4. Rāyamukuṭa - author's Paper on his Padacandrikā, ibid. Vol. 28 (1874) p. 109
"""
unambiguous_catalogue_tags = [u'Jones',u'Mack',u'Cop',u'Peters',u'IO',u'Oxf',u'Cambr',u'Paris',u'Hall',u'Khn',u'Report',u'Ben',u'Lgr',u'Tüb',u'Haug',u'Kāṭm',u'Pheh',u'NW',u'NP',u'Brl',u'Burnell',u'Bl',u'Mysore',u'Bhk',u'Bhr',u'Poona',u'Lahore 1882',u'Bonn',u'Jac',u'Vienna',u'Taylor',u'Oppert',u'Rice',u'BP',u'Bühler',u'ASB',u'Sūcīpattra',u'Bhau Dāji',u'BL',u'Cs',u'CU. add',u'Fl',u'GB',u'Goldstücker',u'Gov. Or. Libr. Madras',u'Lund',u'Oudh',u'Rgb',u'Stein',u'Ulwar',u'Weber',u'Adyar Libr',u'Ashburner',u'Bd',u'CS',u'IL',u'Jl',u'Lz',u'Tb',u'Śg',u'Whish',u'Skm',u'Sbhv',u'Śp',u'Rāyamukuṭa',]
#ambiguous_catalogue_tags = [u'Pet',u'W',u'L',u'K',u'Kh',u'B',u'Bik',u'Rādh',u'BA',u'Gu',u'Bh',u'P',u'Kāśīn',u'Lahore',u'H',u'SB',u'D',u'Devīpr',u'Hz',u'AK',u'AS',u'BC',u'Cr',u'Edinburgh',u'Hpr',u'Tod']
Total 64480 literary resources tagged. Even if we discount some 480 odd accidental taggings, there are 64000 taggings which are of fairly good quality.
Analyse the ambiguous_catalogue_tags categorywise. I see three major categories.
L
Pet
is subset of Peters
Cr
- global replace will replace kfcCra..
too. The tentative paths for solutions
\W+[letter]\W+ [0-9pIVXLC]
or something of that sort. This will reduce accidental tagging mostly.These codes need to be deviced yet. Just thinking loud.
@funderburkjim and @gasyoun I will like to see your repsonses.
Just to show you the current output to elicit some response
[Page1-001-a+ 36]
<H>CATALOGUS CATALOGORUM.
<L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA
{#aMSadaSA#}¦ <ab type="subj">jy</ab>. <ls>Rice</ls> 28.
<LEND>
<L>2<pc>1-001,1<k1>aMSuDara<k2>aMSuDara
{#aMSuDara#}¦ <ab type="pers">poet</ab> <ls>Skm</ls>.
<LEND>
<L>3<pc>1-001,1<k1>aMSumatkASyapIya<k2>aMSumatkASyapIya
{#aMSumatkASyapIya#}¦ <ab type="subj">archit</ab>. <ls>Taylor</ls> 1, 314.
<LEND>
<L>4<pc>1-001,1<k1>aMSumadBedasaMgraha<k2>aMSumadBedasaMgraha
{#aMSumadBedasaMgraha#}¦ <ab type="subj">vedānta</ab>, ascribed to Kaśyapa. <ls>Oppert</ls> 5875.
<LEND>
<L>5<pc>1-001,1<k1>aMSumAnakalpa<k2>aMSumAnakalpa
{#aMSumAnakalpa#}¦ <ab type="subj">śilpa</ab>. <ls>Burnell</ls> 62^b.
<LEND>
<L>6<pc>1-001,1<k1>akaqamacakracitra<k2>akaqamacakracitra
{#akaqamacakracitra#}¦ <ab type="subj">tantr</ab>. B. 4, 252.
<LEND>
<L>7<pc>1-001,1<k1>akArAdiniGaRwu<k2>akArAdiniGaRwu
{#akArAdiniGaRwu#}¦ <ab type="subj">vocabulary</ab>. <ls>Oppert</ls> 4969.
<LEND>
<L>8<pc>1-001,1<k1>akAlajalada<k2>akAlajalada
{#akAlajalada#}¦ <ab type="pers">poet</ab>, great grandfather of Rājaśekhara. <ls>Śp</ls>.
<>p. 4. <ls>Peters</ls>. 2, 63.
<LEND>
<L>9<pc>1-001,1<k1>akAlaBAskara<k2>akAlaBAskara
{#akAlaBAskara#}¦ <ab type="subj">dh</ab>. composed in 1715, by Śambhunātha.
<>L. 2269.
<LEND>
<L>10<pc>1-001,1<k1>akulAgamatantra<k2>akulAgamatantra
{#akulAgamatantra#}¦ <ab type="subj">tantra</ab>. B. 4, 252. <ls>Peters</ls>. 3, 399.
<HI1>Akulāgamatantre Yogasārasamuccaya. <ls>Bhr</ls>. 396.
<LEND>
<L>11<pc>1-001,1<k1>akzatAdilakzapUjAviDi<k2>akzatAdilakzapUjAviDi
{#akzatAdilakzapUjAviDi#}¦ <ab type="subj">dh</ab>. <ls>Burnell</ls> 146^b.
<LEND>
<L>12<pc>1-001,1<k1>akzapAda<k2>akzapAda
{#akzapAda#}¦ or {#akzacaraRa,#} a name of Gautama, the <ab type="pers">philo-
<>sopher</ab>, <ls>Hall</ls> p. 20.
<LEND>
<L>13<pc>1-001,1<k1>akzamAlApratizWA<k2>akzamAlApratizWA
{#akzamAlApratizWA#}¦ <ab type="subj">dh</ab>. <ls>Burnell</ls> 148^b.
<LEND>
<L>14<pc>1-001,1<k1>akzamAlikopanizad<k2>akzamAlikopanizad
{#akzamAlikopanizad#}¦ <ls>IO</ls>. 3183. L. 436. <ls>Brl</ls>. 59. <ls>Haug</ls>
<>44. <ls>Bhr</ls>. 487.
<LEND>
<L>15<pc>1-001,1<k1>akzayatftIyAvratakaTA<k2>akzayatftIyAvratakaTA
{#akzayatftIyAvratakaTA#}¦ from Bhaviṣyottarapurāṇa. <ls>Ben</ls>. 55.
<LEND>
Single letter abbreviations e.g. L
Done now. 78555 tags now.
No need to dig deeper.
Agree.
64480 literary resources tagged
Holy macaroni!
The tentative paths for solutions
Agree, had thought about same. And keep the longer tag first
works well, tested so many times.
- Substring of some other abbreviations e.g. Pet is subset of Peters
Done. Total 80150 items tagged.
Interesting work, Dhaval, to add the 'ls' tagging.
One distinction from other such tagging (MW, PW, PWG) is the scope of the tag.
MARKUP SHOWN ABOVE
{#aMSadaSA#}¦ <ab type="subj">jy</ab>. <ls>Rice</ls> 28.
{#aMSuDara#}¦ <ab type="pers">poet</ab> <ls>Skm</ls>.
{#aMSumatkASyapIya#}¦ <ab type="subj">archit</ab>. <ls>Taylor</ls> 1, 314.
ALTERNATE MARKUP (includes information pertaining to location with source)
{#aMSadaSA#}¦ <ab type="subj">jy</ab>. <ls>Rice 28</ls>.
{#aMSuDara#}¦ <ab type="pers">poet</ab> <ls>Skm</ls>.
{#aMSumatkASyapIya#}¦ <ab type="subj">archit</ab>. <ls>Taylor 1, 314</ls>.
From the printed material shown above, generate a list of the officially sanctioned literary source abbreviations like {%Rice%}. Make these into a file acc_ls.txt, and generate official abbreviations from this.
Do one phase where you restrict attention to literary source patterns which only match abbreviations from the official list --- Maybe you've already done this, it wasn't clear to me. Develop statistics for these : Rice 55, Taylor 27, or whatever the instance frequencies turn out to be.
Then ask what likely literary source patterns remain unmarked (e.g., you mentioned 'Pet' as shortening of Peters). How many of these are there?
Then ask what likely literary source patterns remain unmarked (e.g., you mentioned 'Pet' as shortening of Peters). How many of these are there?
It started as a game and now it's the middle of a swamp :sake:
Catalogue tagging over. Added orig/acc5.txt. Only official abbreviations are added as of yet. Pending to analyse the occurrence of these resources.
Regarding my comment above about scope of ', I think your choice of scope is fine. In contrast to normal dictionaries where we might want the specific verse available from Rig Veda for linking, there is no possibility of digital linking for the catalogues of ACC that are being marked with
It would be useful to have a table, perhaps of form
X:count:expansion
for each of the X appearing in <ls>X</ls>
.
As with subject/person abbreviations, such a table would permit tooltips in current displays, and perhaps have other uses also.
At least for the X that occur in the preface, it should be possible to make an abbreviated expansion.
Another possibility would be to recode the above listing {%Jones%} .etc as an HTML file with
anchors <a name="Jones"/>
and have display logic to open the html file at the anchor, as is done
if <ls>
links in MW, PW.
Still another possibility would be to have the table to be of form `X:count:pdfpage', and have displays link to the scanned image containing the page, eg for 'Jones' have a link to http://www.sanskrit-lexicon.uni-koeln.de/scans/ACCScan/2014/web/pdfpages/pg1_801.pdf
Regarding official catalogue codes
These are in catalogue_names.txt ?
acc5 only adds markup to the official ones in catalogue_names?
You say above Only official abbreviations are added as of yet
. Still true?
Is there a frequency count for the 'unofficial' ones, not yet marked in acc5?
Pet
competes with Peters
.This seems like a red-herring, since both abbreviations occur in the official list
<P>4. {%Pet.%} Verzeichniss der auf Indien bezüglichen Handschriften und Holzdrucke im Asiatischen Museum,
<>von Otto Böhtlingk.
...
<P>50. {%Peters.%} From these we turn with pleasure to three volumes published by Professor Peterson.
Given your TODO list, I am uncertain whether you view acc5 as ready to install at Cologne.
My impression is that:
Peters
appears in the lists for all three volumes, but with a different meaning in each.
Thus in volume 2
<L>31313<pc>2-001,1<k1>akzaracintAmaRi<k2>akzaracintAmaRi
{#akzaracintAmaRi#}¦ <ab type="subj">jy</ab>. <ls>Peters</ls>. 4, 33. <ls>Stein</ls> 156.
Peters
refers to a different catalog than in volume 1
<L>8<pc>1-001,1<k1>akAlajalada<k2>akAlajalada
{#akAlajalada#}¦ <ab type="pers">poet</ab>, great grandfather of Rājaśekhara. <ls>Śp</ls>.
<>p. 4. <ls>Peters</ls>. 2, 63.
<LEND>
And similarly for volume 3
<L>41694<pc>3-001,1<k1>akzaracintAmaRi<k2>akzaracintAmaRi
{#akzaracintAmaRi#}¦ <ab type="subj">jy</ab>. <ls>AK</ls> 847. <ls>AS</ls> p. 1. <ls>Peters</ls>. 6, 401.
<LEND>
These could be distinguished by adding an attribute to ls
, but only in the case like Peters
, which are
duplicates.
I don't know whether there are any other duplicates besides Peters.
Some details about tags as requested above
Wilson Cordier Burnouf Proceed. ASB Sbhv Skm Hr. Notices Hultzsch Edinburgh Pandit Colebrooke Misc. Essays Hr Würzburg Lahore 1882 Cop Kāvyamālā Catal. IO Bendall Kielhorn Thomas Adyar Libr Śg H. H. Wilson Vs Rāyamukuṭa Gov. Or. Libr.Madras Śp Colebrooke
Oudh XX Rep Cg Edinburgh University Library Adyar Library Copenh
Edinburgh Würzburg Copenh
All in 1 list? Places and people together?
I think, I will call it a day. If new catalogues come to notice, a new issue can be raised.
First part preface
Second part preface
Third part preface