sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Subject tagging in ACC #142

Closed drdhaval2785 closed 7 years ago

drdhaval2785 commented 7 years ago

¦ [^ A-Z]+[.,] finds out the possible candidates for subjects in ACC. e.g. jy poet archit vedānta śilpa tantr vocabulary dh

<L>1<pc>1-001,1<k1>aMSadaSA<k2>aMSadaSA
{#aMSadaSA#}¦ jy. Rice 28.
<LEND>
<L>2<pc>1-001,1<k1>aMSuDara<k2>aMSuDara
{#aMSuDara#}¦ poet Skm.
<LEND>
<L>3<pc>1-001,1<k1>aMSumatkASyapIya<k2>aMSumatkASyapIya
{#aMSumatkASyapIya#}¦ archit. Taylor 1, 314.
<LEND>
<L>4<pc>1-001,1<k1>aMSumadBedasaMgraha<k2>aMSumadBedasaMgraha
{#aMSumadBedasaMgraha#}¦ vedānta, ascribed to Kaśyapa. Oppert 5875.
<LEND>
<L>5<pc>1-001,1<k1>aMSumAnakalpa<k2>aMSumAnakalpa
{#aMSumAnakalpa#}¦ śilpa. Burnell 62^b.
<LEND>
<L>6<pc>1-001,1<k1>akaqamacakracitra<k2>akaqamacakracitra
{#akaqamacakracitra#}¦ tantr. B. 4, 252.
<LEND>
<L>7<pc>1-001,1<k1>akArAdiniGaRwu<k2>akArAdiniGaRwu
{#akArAdiniGaRwu#}¦ vocabulary. Oppert 4969.
<LEND>
<L>8<pc>1-001,1<k1>akAlajalada<k2>akAlajalada
{#akAlajalada#}¦ poet, great grandfather of Rājaśekhara. Śp.
<>p. 4. Peters. 2, 63.
<LEND>
<L>9<pc>1-001,1<k1>akAlaBAskara<k2>akAlaBAskara
{#akAlaBAskara#}¦ dh. composed in 1715, by Śambhunātha.
<>L. 2269.
<LEND>
drdhaval2785 commented 7 years ago

A cursory analysis gives the following classification. The upper items are true. Some are false positives like ibid, i etc. The lower down may be some typo errors.

dh:1951
jy:1587
tantr:1463
vedānta:1037
ny:950
poet:863
med:824
kāvya:683
gr:540
śr:417
stotra:382
alaṃk:292
nāṭaka:259
vaid:257
bhakti:249
ibid:204
paur:180
mīm:179
an:172
yoga:147
tantra:138
Āpast:132
lex:123
śaiva:103
vaiś:101
Ṛv:74
(?):67
astrol:61
Śg:59
grammarian:57
archit:55
lexicon:54
music:53
astronomer:53
Āśval:52
metrics:50
i:50
king:47
bhāṇa:44
grammar:38
prayoga:35
campū:33
mantra:25
śilpa:25
prahasana:23
poetess:20
vaiṣṇava:20
glossary:20
augury:19
astron:19
erotic:17
āgama:16
lexicographer:16
mahākāvya:14
gṛhya:13
nīti:13
sāṃkhya:12
nāṭikā:12
vocabulary:11
kāmaśāstra:9
Śāṅkh:9
poetry:9
q:9
philosopher:8
vyāyoga:7
anthology:7
vaidic:6
lawyer:6
cookery:6
palmistry:6
caritra:5
wrote:5
(Ṛv:5
veterinary:5
pl:5
(v:4
i.e:4
sculpture:4
adhy:4
author:4
chāyānāṭaka:4
Āśv:4
laghu:4
Śp:4
tales:4
vāstuśāstra:3
(q:3
svaraśāstra:3
(i:3
vedanta:3
math:3
vedāṅga:3
etc:3
toxicology:3
(which?):3
metries:3
Śabdenduśekharaṭīkā:3
kāvyaṭīkā:3
saṭṭaka:2
algebr:2
oneiromancy:2
brāhmaṇa:2
Āndhra:2
mus:2
poets:2
erotics:2
algebra:2
geometry:2
Śānkh:2
enigmatology:2
poem:2
bṛhat:2
kavya:2
work:2
nītiśāstra:2
miśrabhāṇa:2
chandas:2
śrāddha:2
veter:2
chem:2
composed:1
stuti:1
syntax:1
gaṇita:1
(modern):1
dhātupāṭha:1
saṃgīta:1
grammatical:1
incantations:1
(lex:1
kathā:1
Śrāddhapaddhati:1
Ṣaḍvargaphala:1
vedabhāṣya:1
pañcākṣara:1
śaivabhāṣya:1
Śivarātrivratodyāpaṇam:1
five:1
cer:1
Āraṇyakagānabhāṣya:1
stoma:1
(śv:1
Śivagaurīsaṃvāda:1
Āgamārṇava:1
dig:1
52:1
p:1
nātaka:1
Ānandadīpikā:1
Āsurīkalpaḥ:1
kāvyā:1
tantr.:1
kāṇva:1
physician:1
snānādipūjāntapaddhati:1
pāśupata:1
Āpastambasūtrabhāṣya:1
Śatarudriyajape:1
ākhyāyikā:1
archit.:1
orthographical:1
geom:1
probably:1
mādhyaṃdinaśākhīya:1
alaṃkāra:1
‘prayoga’:1
Śiśupālavadhaṭīkā:1
jugglery:1
seventy:1
pur:1
śv:1
daśadoṣagrantha:1
(jy.):1
Śarabhakavaca:1
khila:1
title:1
aśvalakṣaṇa:1
gajaśāstra:1
or:1
śvet:1
Ādyāvidyāprakaraṇa:1
vihārakārikā:1
(jaina?):1
1572--85:1
alone:1
geogr:1
ethics:1
ancient:1
Āpaduddhārabhairavapañcāṅga:1
buddhistic:1
phonetics:1
strategy:1
Śataślokīṭīkā:1
bhāṇikā:1
a:1
rūpaka:1
modern:1
Āśvalāyanopayogyādhānaprakaraṇa:1
Āgneya:1
mathem:1
gṛhyaprayoga:1
sorcery:1
(yajñīyāni):1
aphrodisiacs:1
prākṛtakāvya:1
padasaṃkhyā:1
Āndhravyākaraṇa:1
Śṛṅgārasāriṇī:1
astrologer:1
minister:1
Śākalokta:1
Śākinīcaritraviṣaya:1
śrauta:1
kriyāpāda:1
(inc.):1
(ch:1
medical:1
śivapūjā:1
metr:1
prāyaścittam:1
khaṇḍakāvya:1
nirukta:1
(p:1
śikṣā:1
kṛṣiśāstra:1
śākuna:1
Ānandalaharī:1
nighaṇṭu:1
smṛti:1
72:1
Āgama:1
land-surveying:1
pra{??}:1
Śāntiśatakaṭīka:1
jy.:1
archery:1
taṅtr:1
seventy-two:1
śaivāgama:1
{??}r:1
mantranighaṇṭu:1
jain:1
(vedānta):1
gasyoun commented 7 years ago

Some are false positives like ibid

Why not mark them as abbreviations? English abbreviations.

drdhaval2785 commented 7 years ago

@funderburkjim,

Q - Where should this subject tagging be done? In acc.txt or acc.xml? Will code accordingly. The module is almost ready.

funderburkjim commented 7 years ago

Are these subject codes enumerated in some preface or appendix of ACC? Does the author specifically mention that he classifies works by subject?

How exactly are you suggesting to mark them.

What about @gasyoun 's suggestion to mark them as abbreviations? Should there be a distinction between 'subjects' and other kinds of abbreviations?

What is the purpose of marking these ?

As to where to add markup, probably this would be best done in acc.txt. In fact, these could perhaps be done by the standard 'correction' method (updateByLine). This would allow very fine-grained control of exceptions.

If the markup were done by <ab>x</ab> or <ab type="subj">x</ab> , then changes would need to be made to disp.php. Would we need to have a separate table of 'abbreviation definitions' to use in display to generate tooltips? Or are the subject codes adequately self-explanatory?

Summary

At this point, this proposed addition still seems experimental. In particular, it should not yet impinge upon the production version of acc. One possibility for exploring this as an experiment would be to make an experimental copy of acc on dev server:

cp -r acc accsubj
chmod -R 0755 accsubj

And then develop to heart's content with the copy, until ideas gel.

Or maybe this could be done with git branches --- I have no experience with these.

gasyoun commented 7 years ago

Are these subject codes enumerated in some preface or appendix of ACC?

No, nowhere. As you are aware even a full list of abbreviations is not always a part of the book, there are additions needed.

Does the author specifically mention that he classifies works by subject?

Never ever.

Should there be a distinction between 'subjects' and other kinds of abbreviations?

I guess yes, at least there could be some subgroups. Topics one side, non-topics - on the other bank.

What is the purpose of marking these ?

It's already there. It is called structured data. Now we have a plain text file. Initially, the book was not plain, actually. So we bring back how it could have been structured, without losing anything that was inside the real book.

If the markup were done by x or x , then changes would need to be made to disp.php.

Disp approach is no good one. It means it will work only on the web. But web is not the main goal. Android, Windows, Linux - that is where we want to spread, that means disp.php would be the last place to go. Only if as some bonus.

Or are the subject codes adequately self-explanatory?

They never are, tooltip always welcome. But what's important now is to code it, the display solution can come after.

In particular, it should not yet impinge upon the production version of acc.

Because... risky?

funderburkjim commented 7 years ago

@gasyoun comments confirm that the right place to do the markup is within acc.txt.

If we structure the acc.txt markup in an xml form (<ab>x</ab>), then that markup will flow smoothly into acc.xml; no change in make_xml.py will be required.

Regarding changes to 'disp': this is needed so that the additional markup within acc.xml will be displayed in some appropriate manner.

Since the printed text does not describe the abbreviations, we should definitely include as part of the task a table of abbreviation definitions.

Because... risky?

Not exactly. Rather, it needs to be done carefully, and there is considerable work to be done before this additional markup is ready for the production version. When this preparatory work is done, then I see no problem in adding this enhancement to acc production version. It makes sense to me to do this preparatory work on an experimental version of acc. Then when the experimental version is ready, it can be merged into the production version.

gasyoun commented 7 years ago

Regarding changes to 'disp': this is needed so that the additional markup within acc.xml will be displayed in some appropriate manner.

That I do understand. But I also understand that making too many different display files is a one-way ticket. We need one universal file, and not 33 different ones, right?

It makes sense to me to do this preparatory work on an experimental version of acc. Then when the experimental version is ready, it can be merged into the production version.

I see.

funderburkjim commented 7 years ago

one universal file, and not 33 different ones

Agree in part. However, there will always be a desire to bring out the full glory of a particular dictionary. However, in so far as feasible, we should use a common markup vocabulary among dictionaries. For instance, the <ab> (abbreviation) markup seems like a good choice, that should be usable among many dictionaries; and that therefore can be also displayed in a universal way.

This far, we have made good progress with the 'meta' lines in devising a more uniform structure that should be applicable to all the dictionaries, and that should allow for a uniform treatment of alternate headwords.

We have not yet given the same attention to a 'universal' display (i.e. a universal 'disp.php').

And similarly, with the xml structure (xxx.dtd). We have agreed upon the idea that a universal dtd is desireable, but have not yet worked through the details of making this idea a reality.

gasyoun commented 7 years ago

1)

We have not yet given the same attention to a 'universal' display (i.e. a universal 'disp.php').

Yeah, but that is a CSS field as well, I guess.

2)

And similarly, with the xml structure (xxx.dtd).

@fxru has left us, so?

drdhaval2785 commented 7 years ago

How exactly are you suggesting to mark them.

<ab type="subj">vedAnta</ab> <ab type="pers">astronomer</ab> <ab type="book">Oppert</ab>

There are at least three different types in ACC. First two are relatively straightforward. Third is the reference book item. Other dictionaries already have some lit resource tags. They can also be used.

And one more thing. Abbreviation is a misnomer. Some of them are full words as I showed. Does it geel idiotic to use ab tag for them?

I earlier thought about <subj>ABC</subj> tag. But now we should also keep uniform XML and DTD in mind too. So decision of tag name should be fairly universal.

drdhaval2785 commented 7 years ago

Should there be a distinction between 'subjects' and other kinds of abbreviations?

As I showed in above examples, three (maybe more) types carry semantically different data. So distinction is necessary to use it in future.

drdhaval2785 commented 7 years ago

What is the purpose of marking these ?

Let us say a researcher wants to locate all manuscripts on lexicon. With subject tagging his life will become easier. If someone wants to write a piece on astronomers of India, he will benefit.

drdhaval2785 commented 7 years ago

In fact, these could perhaps be done by the standard 'correction' method (updateByLine).

Are you sure we want to do 15000+ lines corrections via this method?

drdhaval2785 commented 7 years ago

Would we need to have a separate table of 'abbreviation definitions' to use in display to generate tooltips? Or are the subject codes adequately self-explanatory?

Subject codes are not that self explanatory. So tooltip will be required. E.g. paur for पौराणिक

drdhaval2785 commented 7 years ago

git branches

I have tried it earlier. Not very trustworthy. Not because of git's flaw, but our inabilities. So letvus stick to good old copy paste in some folder in ccrrectionword and work on it till it is ready.

drdhaval2785 commented 7 years ago

table of abbreviation definitions.

Absolutely yes. Once we have culled out all subject tags, we will have to prepare definition file for tooltip and also prepare a log of how many times a particular tag appears for statistical purposes.

funderburkjim commented 7 years ago

Any idea what proportion of entries have the classification items?

By your comment, the classifications are chosen from a particular spot (just after the broken bar) in an entry. Do these classification 'abbreviations' occur elsewhere in records?

Have you already culled out the false positives from the list shown above (e.g. 'i' 50)?

'Some folder in correctionwork' - Ok for some of the work. But, doing the work in a separate copy of the acc repository might allow more flexibility in exploration. We don't want to burden the final repository with lots of extra stuff.

drdhaval2785 commented 7 years ago

@gasyoun comments confirm that the right place to do the markup is within acc.txt.

As a developer I see this as best choice, but as a reader, I actually read acc.txt sometimes. Once we add tags in every line, its readability goes down. But yes, creating a version of acc.txt with tag stripping will serve my purpose. So I also concur on making changes on acc.txt for better downstream uaability.

gasyoun commented 7 years ago

I earlier thought about ABC tag. But now we should also keep uniform XML and DTD in mind too. So decision of tag name should be fairly universal.

Universal as it might (need to) be, do not see what's bad about it.

If someone wants to write a piece on astronomers of India, he will benefit.

Totally agree. The more if we add stats - how many of each abbreviation occur in dictionary.

I actually read acc.txt sometimes.

So do I.

drdhaval2785 commented 7 years ago

By your comment, the classifications are chosen from a particular spot (just after the broken bar) in an entry. Do these classification 'abbreviations' occur elsewhere in records?

Yes. They do occur. The broken bar area is used to identify the abbreviations. Then \W+abbrv\W+ is used to identify the occurrences of these tags in acc.txt.

drdhaval2785 commented 7 years ago

The subject tagging work is over now. Total 26849 items tagged. The remaining subject tagging can be done as and when they are encountered, just like regular corrections.

@funderburkjim The code and intermediates are in pywork/correctionwork/issue-cologne-142 folder. The output is stored as orig/acc4.txt.

acc4.txt is not yet kept as acc.txt. Once you are OK with the quality of the tagging, we can think of replacing acc.txt with acc4.txt.

drdhaval2785 commented 7 years ago

README for record

Logic

Please see issue 142.

preliminary step (do not redo)

cp ../../manualByLine01.txt prev_manualByLine01.txt

Redo step:

[modify either of subject_tags_changes_step0.txt or manual_examination1.txt as needed] Then, cat prev_manualByLine01.txt subject_tags_changes_step0.txt manual_examination1.txt > temp_manualByLine01.txt Now, as per pywork/update.sh, and in pywork directory: cp correctionwork/issue-cologne-142/temp_manualByLine01.txt manualByLine01.txt python updateByLine.py ../orig/acc3.txt manualByLine01.txt ../orig/acc4.txt

Logic

Please see issue 142.

Step 1

python subject_tagging.py ../../../orig/acc.txt subject_tagging.txt > log.txt

Finds out cases which match '¦ ([^ A-Z]+)[.,]'.

This is potential identification of tag words.

subject_tagging.txt has headword:tag in each line.

log.txt has details about the frequency of occurrence of the tag word.

Step 2

Screen out invalid tags manually from log.txt and segregate the subject_tags and person_type_tags manually.

See subject_tag_addition.py variables subject_tags, person_type_tags, non_subject_tags, possible_error_tags for segregation.

tags having single entries are best left out. To be called a subject, at least two books are advisable.

Step 3

python subject_tag_addition.py ../../../orig/acc.txt acc_with_subject_tags.txt subject_tags_changes_step0.txt

generates the acc_with_subject_tags.txt file with subject tagging.

Also generates a file manual_examination.txt which requires manual examination.

Step 4

cp manual_examination.txt manual_examination1.txt.

This step is kept manual, so that accidental overwrite is avoided.

Step 5

After manual examination of manual_examination1.txt , the relevant entries are kept. Rest are deleted.

Note -

There is a tricky issue as regards anthology (an). The word 'an' is a common English word occurring ubiquitously. So a separate script anthology.py was used to identify only the words qualifying to be tagged.

Step 6

NOTE: Do not do this here. Do the similar step in pywork/update.sh Update the manual corrections and generate acc4.txt

python updateByLine.py correctionwork/issue-cologne-142/acc_with_subject_tags.txt correctionwork/issue-cologne-142/manual_examination1.txt ../orig/acc4.txt

drdhaval2785 commented 7 years ago

Example

<L>43<pc>1-002,1<k1>agAravinoda<k2>agAravinoda
{#agAravinoda#}¦ <ab type="subj">archit</ab>. by Durgāśaṅkara. NW. 554.
<LEND>
<L>44<pc>1-002,1<k1>agnikarman<k2>agnikarman
{#agnikarman#}¦ <ab type="subj">med</ab>. B. 4, 216.
<LEND>
<L>45<pc>1-002,1<k1>agnikARqavrAhmaRa<k2>agnikARqavrAhmaRa
{#agnikARqavrAhmaRa#}¦ Oppert II, 4441. <symbol n="C.">C.</symbol> II, 4442. See
<>Agnibrāhmaṇa, Agnirahasyakāṇḍa.
<LEND>
<L>46<pc>1-002,1<k1>agnikArya<k2>agnikArya
{#agnikArya#}¦ <ab type="subj">dh</ab>. Burnell 150^b. Taylor 1, 275.
<LEND>
<L>47<pc>1-002,1<k1>agnikAryaprayoga<k2>agnikAryaprayoga
{#agnikAryaprayoga#}¦ <ab type="subj">śr</ab>. Oppert II, 3951.
<LEND>
<L>48<pc>1-002,1<k1>agnikumAra<k2>agnikumAra,
{#agnikumAra,#}¦ a name of Viṭṭhala, <ab type="pers">son</ab> of Vallabhācārya.
<>Hall p. 147.
<LEND>
......
<L>187<pc>1-005,1<k1>acyuta<k2>acyuta,
{#acyuta,#}¦ <ab type="pers">minister</ab> to Śivasiṃha, <ab type="pers">king</ab> of Mithilā, <ab type="pers">father</ab> of
<>Ratnapāṇi (Kāvyadarpaṇa), <ab type="pers">father</ab> of Ravi (Kāvya-
<>prakāśaṭīkā). Peters. 3, 333.
<LEND>
....
<L>1117<pc>1-030,1<k1>aruRadatta<k2>aruRadatta
{#aruRadatta#}¦ <ab type="pers">lexicographer</ab> and <ab type="pers">grammarian</ab>. Quoted by Ujjva-
<>ladatta and Rāyamukuṭa. See Gaṇaratnamahodadhi
<>p. 119.
<LEND>

Two values pers and subj assigned to the ab tag.

funderburkjim commented 7 years ago

problem with acc.txt on dev-server

@drdhaval2785 I pulled your changes from dev server, to investigate the subject tagging work you've done.

In the course of this, I noticed that 'orig/acc.txt' had a different number of lines (2 less) than 'orig/acc3.txt'

$ wc -l acc3.txt
210469 acc3.txt

$ wc -l acc.txt
210467 acc.txt

I checked on server, and the same held there.

So, somehow acc.txt lost two lines: the two blank lines at line# 37552 and line# 37553 that result from removing cintAmaRi headword. When I reran update_sync on my server, this restored the two blank lines in acc.txt. I pushed this change to dev server (commit d6e0511f4e).

I'd love to know the exact reason this happened - Is there some way in git to track the history of acc.txt? If so, this might provide a clue.

funderburkjim commented 7 years ago

minor revision of programs making acc4.txt

A slight reorganization of code makes ALL the 25000+ changes in the form of normal line update changes.

A slight efficiency in one program (subject_tag_addition.py) uncovered a bug, in that some abbreviations were previously added to the meta-lines.

---  WRONG. now corrected.
> <L>6053<pc>1-155,2<k1><ab type="pers">guru</ab><k2><ab type="pers">guru</ab>
and a few other similar instances.

What next?

Now that the procedure is repeatable and modifiable, we need to spend some amount of time understanding just what we've got in acc4.txt.

funderburkjim commented 7 years ago

@drdhaval2785 The changes mentioned above have been pushed to dev server.

funderburkjim commented 7 years ago

@drdhaval2785 Have modified update.sh and update_sync.sh to take into account acc4.txt,

but have not yet run update_synch.sh -- so acc.txt, acc.xml, etc. do not reflect acc4.txt. I wanted to ponder the new markup a bit more before pushing to production.

funderburkjim commented 7 years ago

@drdhaval2785 Would you write readme's for issue-cologne-141 and issue-cologne-148?

drdhaval2785 commented 7 years ago

So, somehow acc.txt lost two lines: the two blank lines at line# 37552 and line# 37553 that result from removing cintAmaRi headword.

That may have something to do with the fact that I took acc.txt as it existed before that change and then produced acc4.txt. When inverted, there was some mismatch, obviously because of the change made in acc.txt in interim. I thought that the blank lines are somehow erroneous and I removed them.

Question - in current updation system, you cant delete a line. In present case too, ideally they should have been deleted, but in practice they were replaced by white lines.

But I do understand that some of the items like manualUpdateByLine do use line number. So removing two blank lines will alter their behaviour.

funderburkjim commented 7 years ago

Ah, Good. That explains it.

updateByLine does not handle inserts or deletes, only changes ... as you said.

gasyoun commented 7 years ago

tags having single entries are best left out. To be called a subject, at least two books are advisable.

I've seen a dictionary where lots of tags noted in the intro. In real life - some never came up, 1/3 never documented.

one option is to ignore the new tag in display, provisionally --- until above step complete.

Why not?

drdhaval2785 commented 7 years ago

Are all the markups reasonable?

They were manually examined and added. There were some which were not worth and rejected.

drdhaval2785 commented 7 years ago

Have we missed any markup?

Yes. A thorough review of acc4.txt will uncover some tags surely. In fact a separate issue is opened to add and discuss additional tags.

drdhaval2785 commented 7 years ago

Is a new abbreviation description table required (so as to provide tooltips)

Yes. Not yet made. Pending.

drdhaval2785 commented 7 years ago

Regarding appropriateness and comprehensiveness of tags, I request Jim to develop some code independently without being imfluenced by my approach. This will help us clash the output of my method and his method and refine both algorithms. This yielded good results in PW, PWG literary source identification.

funderburkjim commented 7 years ago

Jim to develop some code independently ...

I want to get back to the IAST conversion for other dictionaries, and give attention to the simple sanskrit spelling project that Marcis began. So it will be a 'while' before I think about such code.

Actually, I would prefer for you to convince me of the 'appropriateness and comprehensiveness of tags' rather than devising code to convince myself.

gasyoun commented 7 years ago

So it will be a 'while' before I think about such code.

So wise, so wise. :v:

drdhaval2785 commented 7 years ago

Actually, I would prefer for you to convince me of the 'appropriateness and comprehensiveness of tags' rather than devising code to convince myself.

Regarding comprehensiveness of the tags, at the cost of repetition, I write the steps followed for generating these taggings.

Procedure

  1. Searched ¦ ([^ A-Z]+)[.,] in acc3.txt.
  2. This found out total of 243 entities matching this regex.
  3. The reason of ignoring A-Z was that they were mainly person names or book names.
  4. Out of those 243, the entries having more than one book were thought appropriate as tags. Presumption is that there should be more than one entry for it to be called a subject. This presumption may not be correct. It will not be much of work to identify the missed out tags from such left out cases if others feel that it is important.
  5. These tags were segregated into four categories manually. Names are self explanatory.
    subject_tags = [u'dh',u'jy',u'tantr',u'vedānta',u'ny',u'med',u'kāvya',u'gr',u'śr',u'stotra',u'alaṃk',u'nāṭaka',u'vaid',u'bhakti',u'paur',u'mīm',u'an',u'yoga',u'tantra',u'Āpast',u'lex',u'śaiva',u'vaiś',u'Ṛv',u'astrol',u'archit',u'lexicon',u'music',u'Āśval',u'metrics',u'bhāṇa',u'grammar',u'prayoga',u'campū',u'mantra',u'śilpa',u'prahasana',u'vaiṣṇava',u'glossary',u'augury',u'astron',u'erotic',u'āgama',u'mahākāvya',u'gṛhya',u'nīti',u'sāṃkhya',u'nāṭikā',u'vocabulary',u'kāmaśāstra',u'poetry',u'vyāyoga',u'anthology',u'vaidic',u'cookery',u'palmistry',u'caritra',u'veterinary',u'pl',u'sculpture',u'adhy',u'chāyānāṭaka',u'Āśv',u'laghu',u'tales',u'vāstuśāstra',u'svaraśāstra',u'math',u'vedāṅga',u'toxicology',u'kāvyaṭīkā',u'saṭṭaka',u'algebr',u'oneiromancy',u'brāhmaṇa',u'mus',u'erotics',u'algebra',u'geometry',u'Śānkh',u'enigmatology',u'poem',u'nītiśāstra',u'miśrabhāṇa',u'chandas',u'śrāddha',u'veter',u'chem',u'stuti',u'syntax',u'gaṇita',u'dhātupāṭha',u'saṃgīta',u'incantations',u'kathā',u'cer',u'stoma']
    person_type_tags = [u'poet',u'grammarian',u'astronomer',u'king',u'poetess',u'lexicographer',u'philosopher',u'lawyer',u'author',u'minister',u'disciple',u'pupil',u'guru',u'son',u'brother',u'uncle',u'nephew',u'father']
    non_subject_tags = [u'ibid',u'(?)',u'Śg',u'i',u'q',u'wrote',u'i.e',u'Śp',u'etc',u'(which?)',u'Śabdenduśekharaṭīkā',u'Āndhra',u'poets',u'bṛhat',u'work',u'(modern)',u'grammatical',u'(lex',u'Śrāddhapaddhati',u'Ṣaḍvargaphala',u'vedabhāṣya',u'',u'śaivabhāṣya',u'five',u'Āraṇyakagānabhāṣya']
    possible_error_tags = [u'(Ṛv',u'(v',u'(q',u'(i',u'vedanta',u'metries',u'kavya']
  6. person_type_tags - some of them were manually added like 'minister', 'grandfather' etc.
  7. Then replacements for such tags were done in all entries with the following regex rep = re.sub(u'([^>\w])('+tag+u')([^\w<])','\g<1><ab type="subj">\g<2></ab>\g<3>',line,re.U).

Comprehensiveness

There are possible three lacunae in comprehensiveness.

  1. Presumption that every possible tag fits into ¦ ([^ A-Z]+)[.,] may not be true. If there are tags missed out by this method (e.g. minister / king etc), the tag is not included, unless manually keyed in.
  2. Some of the single entry items may also qualify for tags e.g. sorcery. PENDING to do.
  3. The tags running into two lines may also be missed. There were total of 12 such missed cases. They were identified by analyze_acc4.py and corrections were generated and appended to manualByLine01.txt.

Appropriateness

As can be seen from non_subject_tags and possible_error_tags, the non-appropriate tags were weeded out. Only subject_tags and person_type_tags were further processed. So manual examination was carried out to ensure appropriateness.

Further semantic classification - TODO

This is the person_type_tags as it stands today.

person_type_tags = [u'poet',u'grammarian',u'astronomer',u'king',u'poetess',u'lexicographer',u'philosopher',u'lawyer',u'author',u'minister',u'disciple',u'pupil',u'guru',u'son',u'brother',u'uncle',u'nephew',u'father']

In my opinion, there are two classes here.

person_attribute_tags = [u'poet',u'grammarian',u'astronomer',u'king',u'poetess',u'lexicographer',u'philosopher',u'lawyer',u'author',u'minister']
person_relationship_tags = [u'disciple',u'pupil',u'guru',u'son',u'brother',u'uncle',u'nephew',u'father']

Such fine graining of semantic data is also possible. PENDING to do.

I would like to see the reactions of others before we can finalize this acc4.txt.

As the procedure is repeatable, so next step of catalogue-tagging will not be much affected by the choices we make here.

gasyoun commented 7 years ago

This presumption may not be correct.

It works. The rest is fine tuning that might not occur the next 100 years.

In my opinion, there are two classes here.

Ah, too detailed, Dhaval, leave it :accept: Otherwise the next step after person_relationship_tags will be woman tags inside it etc. It's already good and discovers new worlds and modes to explore MSS.

drdhaval2785 commented 7 years ago

Author's remarks confirm that he has not made systematic analysis of abbreviations or list for subject tags. So the onus lies on us.

<P>The abbreviations used are for the most part quite clear. <ab type="subj">an</ab>. anonymous, <ab type="subj">dh</ab>. dharma, fr. fragmentary, <>gr. grammatical, <ab type="subj">ny</ab>. nyāya, <ab type="subj">tantr</ab>. tantric. Skm. is the Sūktikarṇāmṛta by Śrīdharadāsa, of which I have copied <>the only two MSS. which hitherto have been discovered. Sbhv. is the Subhāṣitāvali by Vallabhadeva. With Śp. <>I refer to my analysis of the Śārṅgadharapaddhati in Vol. 27 (1873) of the Zeitschrift of the German Oriental <>Society, with Rāyamukuṭa to my Paper on his Padacandrikā, ibid. Vol. 28 (1874) p. 109.
gasyoun commented 7 years ago

most part quite clear

There is a devil in this phrase.

drdhaval2785 commented 7 years ago

@funderburkjim I close this thread. The acc4.txt is reasonably having appropriate tags. Comprehensiveness can be enhanced as and when we come across new tags. Discussions can continue at #152 .

drdhaval2785 commented 7 years ago

For record, acc4_stats.py is the code and acc4_stats.txt is the file holding the statistical information. Not too long. So reproducing it here also. As on 7 June 2017

son:4466
dh:2290
jy:2157
inc:1706
tantr:1650
vedānta:1391
father:1270
med:1138
ny:1120
kāvya:1047
poet:946
gr:808
pupil:738
fr:705
nāṭaka:459
stotra:447
śr:435
alaṃk:381
guru:379
author:363
bhakti:330
an:329
king:302
vaid:276
Āpast:231
mīm:228
brother:226
grammar:211
yoga:210
paur:193
lex:179
tantra:155
Ṛv:139
śaiva:133
Āśval:127
vaiś:125
grammarian:115
poem:113
adhy:91
lexicon:88
music:85
astronomer:82
bhāṇa:79
metrics:76
medical:76
glossary:71
astrol:69
archit:64
Prākṛt:59
vaidic:57
campū:51
minister:49
prayoga:43
vocabulary:41
prahasana:38
disciple:38
laghu:36
dharma:36
anthology:32
nephew:30
lexicographer:30
phonetics:29
vaiṣṇava:29
mantra:28
śilpa:28
augury:27
nāṭikā:27
uncle:25
poetess:24
mahākāvya:24
erotic:24
astron:23
poetry:22
gṛhya:21
vyāyoga:20
ceremonies:20
sāṃkhya:18
philosopher:18
āgama:17
astronomical:15
materia medica:15
nīti:15
metres:14
lawyer:14
Kāmaśāstra:13
tales:13
funeral:13
kāmaśāstra:12
rites:12
precious stones:11
cookery:11
alaṃkāra:10
divination:9
conjugation:9
veterinary:9
Vyavahāra:9
śrāddha:9
ceremony:9
physician:8
palmistry:8
math:8
chāyānāṭaka:8
elephants:7
diseases:7
Āśv:7
jain:7
castes:6
play:6
smṛti:6
saṭṭaka:6
syntax:5
caritra:5
declension:5
architecture:5
paradigms of declension:5
accents:5
bhāṇikā:5
pl:5
sculpture:5
chandas:5
pilgrimage:5
vāstuśāstra:4
svaraśāstra:4
ordeals:4
letter-writing:4
vedāṅga:4
geometry:4
Śānkh:4
obsequies:4
Uṇādis:4
algebra:4
ācāra:4
jaina:3
brāhmaṇa:3
dhātupāṭha:3
images:3
mus:3
worship:3
prāyaścitta:3
miśrabhāṇa:3
toxicology:3
śaiva vedānta:3
enigmatology:3
buddhistic:3
śrauta:3
horses:3
sorcery:3
warfare:3
saṃskārāḥ:3
kāvyaṭīkā:3
saṃgīta:2
roots:2
kathā:2
algebr:2
cer:2
dramatic action:2
metals:2
mystic diagrams:2
gender:2
oneiromancy:2
dancing:2
chess-play:2
chem:2
saṃnyāsa:2
erotics:2
inheritance:2
khila:2
drama:2
botany:2
singing:2
chess:2
marriage:2
nītiśāstra:2
compound nouns:2
veter:2
royal polity:2
philosophy:2
nouns:2
military tactics:2
gaṇita:1
gems:1
hunting:1
incantations:1
witchcraft:1
stoma:1
omina:1
ākhyāyikā:1
verbs:1
geogr:1
conjugations:1
astrologer:1
nirukta:1
stuti:1
pregnancy:1
jugglery:1
strategy:1
gṛhyaprayoga:1
aphrodisiacs:1
metr:1
kṛṣiśāstra:1
khaṇḍakāvya:1
land-surveying:1
ethics:1
archery:1
painting:1
funderburkjim commented 6 years ago

@drdhaval2785

I'm having trouble recreating acc4.txt. Have made no changes to the dev server version.

I'm referring to pywork/update.sh and to pywork/correctionwork/issue-cologne-142/readme.md.

update.sh looks fishy, as there are TWO steps which appear to modify orig/acc4.txt.

Also, update.sh instructions seem inconsistent with the instructions under '## Redo step' of readme.md.

Hope you can revise instructions as needed so acc4.txt can be properly reconstructed.

drdhaval2785 commented 6 years ago

@funderburkjim

I will see and make necessary changes and intimate you here on github.