Closed drdhaval2785 closed 7 years ago
A cursory analysis gives the following classification.
The upper items are true. Some are false positives like ibid
, i
etc.
The lower down may be some typo errors.
dh:1951
jy:1587
tantr:1463
vedānta:1037
ny:950
poet:863
med:824
kāvya:683
gr:540
śr:417
stotra:382
alaṃk:292
nāṭaka:259
vaid:257
bhakti:249
ibid:204
paur:180
mīm:179
an:172
yoga:147
tantra:138
Āpast:132
lex:123
śaiva:103
vaiś:101
Ṛv:74
(?):67
astrol:61
Śg:59
grammarian:57
archit:55
lexicon:54
music:53
astronomer:53
Āśval:52
metrics:50
i:50
king:47
bhāṇa:44
grammar:38
prayoga:35
campū:33
mantra:25
śilpa:25
prahasana:23
poetess:20
vaiṣṇava:20
glossary:20
augury:19
astron:19
erotic:17
āgama:16
lexicographer:16
mahākāvya:14
gṛhya:13
nīti:13
sāṃkhya:12
nāṭikā:12
vocabulary:11
kāmaśāstra:9
Śāṅkh:9
poetry:9
q:9
philosopher:8
vyāyoga:7
anthology:7
vaidic:6
lawyer:6
cookery:6
palmistry:6
caritra:5
wrote:5
(Ṛv:5
veterinary:5
pl:5
(v:4
i.e:4
sculpture:4
adhy:4
author:4
chāyānāṭaka:4
Āśv:4
laghu:4
Śp:4
tales:4
vāstuśāstra:3
(q:3
svaraśāstra:3
(i:3
vedanta:3
math:3
vedāṅga:3
etc:3
toxicology:3
(which?):3
metries:3
Śabdenduśekharaṭīkā:3
kāvyaṭīkā:3
saṭṭaka:2
algebr:2
oneiromancy:2
brāhmaṇa:2
Āndhra:2
mus:2
poets:2
erotics:2
algebra:2
geometry:2
Śānkh:2
enigmatology:2
poem:2
bṛhat:2
kavya:2
work:2
nītiśāstra:2
miśrabhāṇa:2
chandas:2
śrāddha:2
veter:2
chem:2
composed:1
stuti:1
syntax:1
gaṇita:1
(modern):1
dhātupāṭha:1
saṃgīta:1
grammatical:1
incantations:1
(lex:1
kathā:1
Śrāddhapaddhati:1
Ṣaḍvargaphala:1
vedabhāṣya:1
pañcākṣara:1
śaivabhāṣya:1
Śivarātrivratodyāpaṇam:1
five:1
cer:1
Āraṇyakagānabhāṣya:1
stoma:1
(śv:1
Śivagaurīsaṃvāda:1
Āgamārṇava:1
dig:1
52:1
p:1
nātaka:1
Ānandadīpikā:1
Āsurīkalpaḥ:1
kāvyā:1
tantr.:1
kāṇva:1
physician:1
snānādipūjāntapaddhati:1
pāśupata:1
Āpastambasūtrabhāṣya:1
Śatarudriyajape:1
ākhyāyikā:1
archit.:1
orthographical:1
geom:1
probably:1
mādhyaṃdinaśākhīya:1
alaṃkāra:1
‘prayoga’:1
Śiśupālavadhaṭīkā:1
jugglery:1
seventy:1
pur:1
śv:1
daśadoṣagrantha:1
(jy.):1
Śarabhakavaca:1
khila:1
title:1
aśvalakṣaṇa:1
gajaśāstra:1
or:1
śvet:1
Ādyāvidyāprakaraṇa:1
vihārakārikā:1
(jaina?):1
1572--85:1
alone:1
geogr:1
ethics:1
ancient:1
Āpaduddhārabhairavapañcāṅga:1
buddhistic:1
phonetics:1
strategy:1
Śataślokīṭīkā:1
bhāṇikā:1
a:1
rūpaka:1
modern:1
Āśvalāyanopayogyādhānaprakaraṇa:1
Āgneya:1
mathem:1
gṛhyaprayoga:1
sorcery:1
(yajñīyāni):1
aphrodisiacs:1
prākṛtakāvya:1
padasaṃkhyā:1
Āndhravyākaraṇa:1
Śṛṅgārasāriṇī:1
astrologer:1
minister:1
Śākalokta:1
Śākinīcaritraviṣaya:1
śrauta:1
kriyāpāda:1
(inc.):1
(ch:1
medical:1
śivapūjā:1
metr:1
prāyaścittam:1
khaṇḍakāvya:1
nirukta:1
(p:1
śikṣā:1
kṛṣiśāstra:1
śākuna:1
Ānandalaharī:1
nighaṇṭu:1
smṛti:1
72:1
Āgama:1
land-surveying:1
pra{??}:1
Śāntiśatakaṭīka:1
jy.:1
archery:1
taṅtr:1
seventy-two:1
śaivāgama:1
{??}r:1
mantranighaṇṭu:1
jain:1
(vedānta):1
Some are false positives like ibid
Why not mark them as abbreviations? English abbreviations.
@funderburkjim,
Q - Where should this subject tagging be done? In acc.txt or acc.xml? Will code accordingly. The module is almost ready.
Are these subject codes enumerated in some preface or appendix of ACC? Does the author specifically mention that he classifies works by subject?
How exactly are you suggesting to mark them.
What about @gasyoun 's suggestion to mark them as abbreviations? Should there be a distinction between 'subjects' and other kinds of abbreviations?
What is the purpose of marking these ?
As to where to add markup, probably this would be best done in acc.txt. In fact, these could perhaps be done by the standard 'correction' method (updateByLine). This would allow very fine-grained control of exceptions.
If the markup were done by <ab>x</ab>
or <ab type="subj">x</ab>
, then changes would need
to be made to disp.php. Would we need to have a separate table of 'abbreviation definitions' to use
in display to generate tooltips? Or are the subject codes adequately self-explanatory?
At this point, this proposed addition still seems experimental. In particular, it should not yet impinge upon the production version of acc. One possibility for exploring this as an experiment would be to make an experimental copy of acc on dev server:
cp -r acc accsubj
chmod -R 0755 accsubj
And then develop to heart's content with the copy, until ideas gel.
Or maybe this could be done with git branches --- I have no experience with these.
Are these subject codes enumerated in some preface or appendix of ACC?
No, nowhere. As you are aware even a full list of abbreviations is not always a part of the book, there are additions needed.
Does the author specifically mention that he classifies works by subject?
Never ever.
Should there be a distinction between 'subjects' and other kinds of abbreviations?
I guess yes, at least there could be some subgroups. Topics one side, non-topics - on the other bank.
What is the purpose of marking these ?
It's already there. It is called structured data. Now we have a plain text file. Initially, the book was not plain, actually. So we bring back how it could have been structured, without losing anything that was inside the real book.
If the markup were done by
x orx , then changes would need to be made to disp.php.
Disp approach is no good one. It means it will work only on the web. But web is not the main goal. Android, Windows, Linux - that is where we want to spread, that means disp.php
would be the last place to go. Only if as some bonus.
Or are the subject codes adequately self-explanatory?
They never are, tooltip always welcome. But what's important now is to code it, the display solution can come after.
In particular, it should not yet impinge upon the production version of acc.
Because... risky?
@gasyoun comments confirm that the right place to do the markup is within acc.txt.
If we structure the acc.txt markup in an xml form (<ab>x</ab>
), then that markup will flow smoothly
into acc.xml; no change in make_xml.py will be required.
Regarding changes to 'disp': this is needed so that the additional markup within acc.xml will be displayed in some appropriate manner.
Since the printed text does not describe the abbreviations, we should definitely include as part of the task a table of abbreviation definitions.
Because... risky?
Not exactly. Rather, it needs to be done carefully, and there is considerable work to be done before this additional markup is ready for the production version. When this preparatory work is done, then I see no problem in adding this enhancement to acc production version. It makes sense to me to do this preparatory work on an experimental version of acc. Then when the experimental version is ready, it can be merged into the production version.
Regarding changes to 'disp': this is needed so that the additional markup within acc.xml will be displayed in some appropriate manner.
That I do understand. But I also understand that making too many different display files is a one-way ticket. We need one universal file, and not 33 different ones, right?
It makes sense to me to do this preparatory work on an experimental version of acc. Then when the experimental version is ready, it can be merged into the production version.
I see.
one universal file, and not 33 different ones
Agree in part. However, there will always be a desire to bring out the full glory of a particular dictionary.
However, in so far as feasible, we should use a common markup vocabulary among dictionaries. For instance, the <ab>
(abbreviation) markup seems like a good choice, that should be usable among many dictionaries; and that therefore can be also displayed in a universal way.
This far, we have made good progress with the 'meta' lines in devising a more uniform structure that should be applicable to all the dictionaries, and that should allow for a uniform treatment of alternate headwords.
We have not yet given the same attention to a 'universal' display (i.e. a universal 'disp.php').
And similarly, with the xml structure (xxx.dtd). We have agreed upon the idea that a universal dtd is desireable, but have not yet worked through the details of making this idea a reality.
1)
We have not yet given the same attention to a 'universal' display (i.e. a universal 'disp.php').
Yeah, but that is a CSS field as well, I guess.
2)
And similarly, with the xml structure (xxx.dtd).
@fxru has left us, so?
How exactly are you suggesting to mark them.
<ab type="subj">vedAnta</ab>
<ab type="pers">astronomer</ab>
<ab type="book">Oppert</ab>
There are at least three different types in ACC. First two are relatively straightforward. Third is the reference book item. Other dictionaries already have some lit resource tags. They can also be used.
And one more thing. Abbreviation is a misnomer. Some of them are full words as I showed. Does it geel idiotic to use ab tag for them?
I earlier thought about <subj>ABC</subj>
tag. But now we should also keep uniform XML and DTD in mind too. So decision of tag name should be fairly universal.
Should there be a distinction between 'subjects' and other kinds of abbreviations?
As I showed in above examples, three (maybe more) types carry semantically different data. So distinction is necessary to use it in future.
What is the purpose of marking these ?
Let us say a researcher wants to locate all manuscripts on lexicon. With subject tagging his life will become easier. If someone wants to write a piece on astronomers of India, he will benefit.
In fact, these could perhaps be done by the standard 'correction' method (updateByLine).
Are you sure we want to do 15000+ lines corrections via this method?
Would we need to have a separate table of 'abbreviation definitions' to use in display to generate tooltips? Or are the subject codes adequately self-explanatory?
Subject codes are not that self explanatory. So tooltip will be required. E.g. paur for पौराणिक
git branches
I have tried it earlier. Not very trustworthy. Not because of git's flaw, but our inabilities. So letvus stick to good old copy paste in some folder in ccrrectionword and work on it till it is ready.
table of abbreviation definitions.
Absolutely yes. Once we have culled out all subject tags, we will have to prepare definition file for tooltip and also prepare a log of how many times a particular tag appears for statistical purposes.
Any idea what proportion of entries have the classification items?
By your comment, the classifications are chosen from a particular spot (just after the broken bar) in an entry. Do these classification 'abbreviations' occur elsewhere in records?
Have you already culled out the false positives from the list shown above (e.g. 'i' 50)?
'Some folder in correctionwork' - Ok for some of the work. But, doing the work in a separate copy of the acc repository might allow more flexibility in exploration. We don't want to burden the final repository with lots of extra stuff.
@gasyoun comments confirm that the right place to do the markup is within acc.txt.
As a developer I see this as best choice, but as a reader, I actually read acc.txt sometimes. Once we add tags in every line, its readability goes down. But yes, creating a version of acc.txt with tag stripping will serve my purpose. So I also concur on making changes on acc.txt for better downstream uaability.
I earlier thought about
ABC tag. But now we should also keep uniform XML and DTD in mind too. So decision of tag name should be fairly universal.
Universal as it might (need to) be, do not see what's bad about it.
If someone wants to write a piece on astronomers of India, he will benefit.
Totally agree. The more if we add stats - how many of each abbreviation occur in dictionary.
I actually read acc.txt sometimes.
So do I.
By your comment, the classifications are chosen from a particular spot (just after the broken bar) in an entry. Do these classification 'abbreviations' occur elsewhere in records?
Yes. They do occur. The broken bar area is used to identify the abbreviations. Then \W+abbrv\W+
is used to identify the occurrences of these tags in acc.txt.
The subject tagging work is over now. Total 26849 items tagged. The remaining subject tagging can be done as and when they are encountered, just like regular corrections.
@funderburkjim The code and intermediates are in pywork/correctionwork/issue-cologne-142 folder. The output is stored as orig/acc4.txt.
acc4.txt is not yet kept as acc.txt. Once you are OK with the quality of the tagging, we can think of replacing acc.txt with acc4.txt.
README for record
Please see issue 142.
cp ../../manualByLine01.txt prev_manualByLine01.txt
[modify either of subject_tags_changes_step0.txt or manual_examination1.txt as needed] Then, cat prev_manualByLine01.txt subject_tags_changes_step0.txt manual_examination1.txt > temp_manualByLine01.txt Now, as per pywork/update.sh, and in pywork directory: cp correctionwork/issue-cologne-142/temp_manualByLine01.txt manualByLine01.txt python updateByLine.py ../orig/acc3.txt manualByLine01.txt ../orig/acc4.txt
Please see issue 142.
python subject_tagging.py ../../../orig/acc.txt subject_tagging.txt > log.txt
Finds out cases which match '¦ ([^ A-Z]+)[.,]'.
This is potential identification of tag words.
subject_tagging.txt has headword:tag in each line.
log.txt has details about the frequency of occurrence of the tag word.
Screen out invalid tags manually from log.txt and segregate the subject_tags and person_type_tags manually.
See subject_tag_addition.py variables subject_tags, person_type_tags, non_subject_tags, possible_error_tags for segregation.
tags having single entries are best left out. To be called a subject, at least two books are advisable.
python subject_tag_addition.py ../../../orig/acc.txt acc_with_subject_tags.txt subject_tags_changes_step0.txt
generates the acc_with_subject_tags.txt file with subject tagging.
Also generates a file manual_examination.txt which requires manual examination.
cp manual_examination.txt manual_examination1.txt
.
This step is kept manual, so that accidental overwrite is avoided.
After manual examination of manual_examination1.txt , the relevant entries are kept. Rest are deleted.
Note -
There is a tricky issue as regards anthology (an). The word 'an' is a common English word occurring ubiquitously. So a separate script anthology.py was used to identify only the words qualifying to be tagged.
NOTE: Do not do this here. Do the similar step in pywork/update.sh Update the manual corrections and generate acc4.txt
python updateByLine.py correctionwork/issue-cologne-142/acc_with_subject_tags.txt correctionwork/issue-cologne-142/manual_examination1.txt ../orig/acc4.txt
Example
<L>43<pc>1-002,1<k1>agAravinoda<k2>agAravinoda
{#agAravinoda#}¦ <ab type="subj">archit</ab>. by Durgāśaṅkara. NW. 554.
<LEND>
<L>44<pc>1-002,1<k1>agnikarman<k2>agnikarman
{#agnikarman#}¦ <ab type="subj">med</ab>. B. 4, 216.
<LEND>
<L>45<pc>1-002,1<k1>agnikARqavrAhmaRa<k2>agnikARqavrAhmaRa
{#agnikARqavrAhmaRa#}¦ Oppert II, 4441. <symbol n="C.">C.</symbol> II, 4442. See
<>Agnibrāhmaṇa, Agnirahasyakāṇḍa.
<LEND>
<L>46<pc>1-002,1<k1>agnikArya<k2>agnikArya
{#agnikArya#}¦ <ab type="subj">dh</ab>. Burnell 150^b. Taylor 1, 275.
<LEND>
<L>47<pc>1-002,1<k1>agnikAryaprayoga<k2>agnikAryaprayoga
{#agnikAryaprayoga#}¦ <ab type="subj">śr</ab>. Oppert II, 3951.
<LEND>
<L>48<pc>1-002,1<k1>agnikumAra<k2>agnikumAra,
{#agnikumAra,#}¦ a name of Viṭṭhala, <ab type="pers">son</ab> of Vallabhācārya.
<>Hall p. 147.
<LEND>
......
<L>187<pc>1-005,1<k1>acyuta<k2>acyuta,
{#acyuta,#}¦ <ab type="pers">minister</ab> to Śivasiṃha, <ab type="pers">king</ab> of Mithilā, <ab type="pers">father</ab> of
<>Ratnapāṇi (Kāvyadarpaṇa), <ab type="pers">father</ab> of Ravi (Kāvya-
<>prakāśaṭīkā). Peters. 3, 333.
<LEND>
....
<L>1117<pc>1-030,1<k1>aruRadatta<k2>aruRadatta
{#aruRadatta#}¦ <ab type="pers">lexicographer</ab> and <ab type="pers">grammarian</ab>. Quoted by Ujjva-
<>ladatta and Rāyamukuṭa. See Gaṇaratnamahodadhi
<>p. 119.
<LEND>
Two values pers
and subj
assigned to the ab
tag.
@drdhaval2785 I pulled your changes from dev server, to investigate the subject tagging work you've done.
In the course of this, I noticed that 'orig/acc.txt' had a different number of lines (2 less) than 'orig/acc3.txt'
$ wc -l acc3.txt
210469 acc3.txt
$ wc -l acc.txt
210467 acc.txt
I checked on server, and the same held there.
So, somehow acc.txt lost two lines: the two blank lines at line# 37552 and line# 37553 that result from removing cintAmaRi headword. When I reran update_sync on my server, this restored the two blank lines in acc.txt. I pushed this change to dev server (commit d6e0511f4e).
I'd love to know the exact reason this happened - Is there some way in git to track the history of acc.txt? If so, this might provide a clue.
A slight reorganization of code makes ALL the 25000+ changes in the form of normal line update changes.
A slight efficiency in one program (subject_tag_addition.py) uncovered a bug, in that some abbreviations were previously added to the meta-lines.
--- WRONG. now corrected.
> <L>6053<pc>1-155,2<k1><ab type="pers">guru</ab><k2><ab type="pers">guru</ab>
and a few other similar instances.
Now that the procedure is repeatable and modifiable, we need to spend some amount of time understanding just what we've got in acc4.txt.
<ab>
tags will just pass through@drdhaval2785 The changes mentioned above have been pushed to dev server.
@drdhaval2785 Have modified update.sh and update_sync.sh to take into account acc4.txt,
but have not yet run update_synch.sh -- so acc.txt, acc.xml, etc. do not reflect acc4.txt. I wanted to ponder the new markup a bit more before pushing to production.
@drdhaval2785 Would you write readme's for issue-cologne-141 and issue-cologne-148?
So, somehow acc.txt lost two lines: the two blank lines at line# 37552 and line# 37553 that result from removing cintAmaRi headword.
That may have something to do with the fact that I took acc.txt as it existed before that change and then produced acc4.txt. When inverted, there was some mismatch, obviously because of the change made in acc.txt in interim. I thought that the blank lines are somehow erroneous and I removed them.
Question - in current updation system, you cant delete a line. In present case too, ideally they should have been deleted, but in practice they were replaced by white lines.
But I do understand that some of the items like manualUpdateByLine do use line number. So removing two blank lines will alter their behaviour.
Ah, Good. That explains it.
updateByLine does not handle inserts or deletes, only changes ... as you said.
tags having single entries are best left out. To be called a subject, at least two books are advisable.
I've seen a dictionary where lots of tags noted in the intro. In real life - some never came up, 1/3 never documented.
one option is to ignore the new tag in display, provisionally --- until above step complete.
Why not?
Are all the markups reasonable?
They were manually examined and added. There were some which were not worth and rejected.
Have we missed any markup?
Yes. A thorough review of acc4.txt will uncover some tags surely. In fact a separate issue is opened to add and discuss additional tags.
Is a new abbreviation description table required (so as to provide tooltips)
Yes. Not yet made. Pending.
Regarding appropriateness and comprehensiveness of tags, I request Jim to develop some code independently without being imfluenced by my approach. This will help us clash the output of my method and his method and refine both algorithms. This yielded good results in PW, PWG literary source identification.
Jim to develop some code independently ...
I want to get back to the IAST conversion for other dictionaries, and give attention to the simple sanskrit spelling project that Marcis began. So it will be a 'while' before I think about such code.
Actually, I would prefer for you to convince me of the 'appropriateness and comprehensiveness of tags' rather than devising code to convince myself.
So it will be a 'while' before I think about such code.
So wise, so wise. :v:
Actually, I would prefer for you to convince me of the 'appropriateness and comprehensiveness of tags' rather than devising code to convince myself.
Regarding comprehensiveness of the tags, at the cost of repetition, I write the steps followed for generating these taggings.
¦ ([^ A-Z]+)[.,]
in acc3.txt.subject_tags = [u'dh',u'jy',u'tantr',u'vedānta',u'ny',u'med',u'kāvya',u'gr',u'śr',u'stotra',u'alaṃk',u'nāṭaka',u'vaid',u'bhakti',u'paur',u'mīm',u'an',u'yoga',u'tantra',u'Āpast',u'lex',u'śaiva',u'vaiś',u'Ṛv',u'astrol',u'archit',u'lexicon',u'music',u'Āśval',u'metrics',u'bhāṇa',u'grammar',u'prayoga',u'campū',u'mantra',u'śilpa',u'prahasana',u'vaiṣṇava',u'glossary',u'augury',u'astron',u'erotic',u'āgama',u'mahākāvya',u'gṛhya',u'nīti',u'sāṃkhya',u'nāṭikā',u'vocabulary',u'kāmaśāstra',u'poetry',u'vyāyoga',u'anthology',u'vaidic',u'cookery',u'palmistry',u'caritra',u'veterinary',u'pl',u'sculpture',u'adhy',u'chāyānāṭaka',u'Āśv',u'laghu',u'tales',u'vāstuśāstra',u'svaraśāstra',u'math',u'vedāṅga',u'toxicology',u'kāvyaṭīkā',u'saṭṭaka',u'algebr',u'oneiromancy',u'brāhmaṇa',u'mus',u'erotics',u'algebra',u'geometry',u'Śānkh',u'enigmatology',u'poem',u'nītiśāstra',u'miśrabhāṇa',u'chandas',u'śrāddha',u'veter',u'chem',u'stuti',u'syntax',u'gaṇita',u'dhātupāṭha',u'saṃgīta',u'incantations',u'kathā',u'cer',u'stoma']
person_type_tags = [u'poet',u'grammarian',u'astronomer',u'king',u'poetess',u'lexicographer',u'philosopher',u'lawyer',u'author',u'minister',u'disciple',u'pupil',u'guru',u'son',u'brother',u'uncle',u'nephew',u'father']
non_subject_tags = [u'ibid',u'(?)',u'Śg',u'i',u'q',u'wrote',u'i.e',u'Śp',u'etc',u'(which?)',u'Śabdenduśekharaṭīkā',u'Āndhra',u'poets',u'bṛhat',u'work',u'(modern)',u'grammatical',u'(lex',u'Śrāddhapaddhati',u'Ṣaḍvargaphala',u'vedabhāṣya',u'',u'śaivabhāṣya',u'five',u'Āraṇyakagānabhāṣya']
possible_error_tags = [u'(Ṛv',u'(v',u'(q',u'(i',u'vedanta',u'metries',u'kavya']
rep = re.sub(u'([^>\w])('+tag+u')([^\w<])','\g<1><ab type="subj">\g<2></ab>\g<3>',line,re.U)
.There are possible three lacunae in comprehensiveness.
¦ ([^ A-Z]+)[.,]
may not be true. If there are tags missed out by this method (e.g. minister / king etc), the tag is not included, unless manually keyed in.sorcery
. PENDING to do.As can be seen from non_subject_tags and possible_error_tags, the non-appropriate tags were weeded out. Only subject_tags and person_type_tags were further processed. So manual examination was carried out to ensure appropriateness.
This is the person_type_tags as it stands today.
person_type_tags = [u'poet',u'grammarian',u'astronomer',u'king',u'poetess',u'lexicographer',u'philosopher',u'lawyer',u'author',u'minister',u'disciple',u'pupil',u'guru',u'son',u'brother',u'uncle',u'nephew',u'father']
In my opinion, there are two classes here.
person_attribute_tags = [u'poet',u'grammarian',u'astronomer',u'king',u'poetess',u'lexicographer',u'philosopher',u'lawyer',u'author',u'minister']
person_relationship_tags = [u'disciple',u'pupil',u'guru',u'son',u'brother',u'uncle',u'nephew',u'father']
Such fine graining of semantic data is also possible. PENDING to do.
I would like to see the reactions of others before we can finalize this acc4.txt.
As the procedure is repeatable, so next step of catalogue-tagging will not be much affected by the choices we make here.
This presumption may not be correct.
It works. The rest is fine tuning that might not occur the next 100 years.
In my opinion, there are two classes here.
Ah, too detailed, Dhaval, leave it :accept: Otherwise the next step after person_relationship_tags
will be woman tags inside it etc. It's already good and discovers new worlds and modes to explore MSS.
Author's remarks confirm that he has not made systematic analysis of abbreviations or list for subject tags. So the onus lies on us.
<P>The abbreviations used are for the most part quite clear. <ab type="subj">an</ab>. anonymous, <ab type="subj">dh</ab>. dharma, fr. fragmentary, <>gr. grammatical, <ab type="subj">ny</ab>. nyāya, <ab type="subj">tantr</ab>. tantric. Skm. is the Sūktikarṇāmṛta by Śrīdharadāsa, of which I have copied <>the only two MSS. which hitherto have been discovered. Sbhv. is the Subhāṣitāvali by Vallabhadeva. With Śp. <>I refer to my analysis of the Śārṅgadharapaddhati in Vol. 27 (1873) of the Zeitschrift of the German Oriental <>Society, with Rāyamukuṭa to my Paper on his Padacandrikā, ibid. Vol. 28 (1874) p. 109.
most part quite clear
There is a devil in this phrase.
@funderburkjim I close this thread. The acc4.txt is reasonably having appropriate tags. Comprehensiveness can be enhanced as and when we come across new tags. Discussions can continue at #152 .
For record, acc4_stats.py is the code and acc4_stats.txt is the file holding the statistical information. Not too long. So reproducing it here also. As on 7 June 2017
son:4466
dh:2290
jy:2157
inc:1706
tantr:1650
vedānta:1391
father:1270
med:1138
ny:1120
kāvya:1047
poet:946
gr:808
pupil:738
fr:705
nāṭaka:459
stotra:447
śr:435
alaṃk:381
guru:379
author:363
bhakti:330
an:329
king:302
vaid:276
Āpast:231
mīm:228
brother:226
grammar:211
yoga:210
paur:193
lex:179
tantra:155
Ṛv:139
śaiva:133
Āśval:127
vaiś:125
grammarian:115
poem:113
adhy:91
lexicon:88
music:85
astronomer:82
bhāṇa:79
metrics:76
medical:76
glossary:71
astrol:69
archit:64
Prākṛt:59
vaidic:57
campū:51
minister:49
prayoga:43
vocabulary:41
prahasana:38
disciple:38
laghu:36
dharma:36
anthology:32
nephew:30
lexicographer:30
phonetics:29
vaiṣṇava:29
mantra:28
śilpa:28
augury:27
nāṭikā:27
uncle:25
poetess:24
mahākāvya:24
erotic:24
astron:23
poetry:22
gṛhya:21
vyāyoga:20
ceremonies:20
sāṃkhya:18
philosopher:18
āgama:17
astronomical:15
materia medica:15
nīti:15
metres:14
lawyer:14
Kāmaśāstra:13
tales:13
funeral:13
kāmaśāstra:12
rites:12
precious stones:11
cookery:11
alaṃkāra:10
divination:9
conjugation:9
veterinary:9
Vyavahāra:9
śrāddha:9
ceremony:9
physician:8
palmistry:8
math:8
chāyānāṭaka:8
elephants:7
diseases:7
Āśv:7
jain:7
castes:6
play:6
smṛti:6
saṭṭaka:6
syntax:5
caritra:5
declension:5
architecture:5
paradigms of declension:5
accents:5
bhāṇikā:5
pl:5
sculpture:5
chandas:5
pilgrimage:5
vāstuśāstra:4
svaraśāstra:4
ordeals:4
letter-writing:4
vedāṅga:4
geometry:4
Śānkh:4
obsequies:4
Uṇādis:4
algebra:4
ācāra:4
jaina:3
brāhmaṇa:3
dhātupāṭha:3
images:3
mus:3
worship:3
prāyaścitta:3
miśrabhāṇa:3
toxicology:3
śaiva vedānta:3
enigmatology:3
buddhistic:3
śrauta:3
horses:3
sorcery:3
warfare:3
saṃskārāḥ:3
kāvyaṭīkā:3
saṃgīta:2
roots:2
kathā:2
algebr:2
cer:2
dramatic action:2
metals:2
mystic diagrams:2
gender:2
oneiromancy:2
dancing:2
chess-play:2
chem:2
saṃnyāsa:2
erotics:2
inheritance:2
khila:2
drama:2
botany:2
singing:2
chess:2
marriage:2
nītiśāstra:2
compound nouns:2
veter:2
royal polity:2
philosophy:2
nouns:2
military tactics:2
gaṇita:1
gems:1
hunting:1
incantations:1
witchcraft:1
stoma:1
omina:1
ākhyāyikā:1
verbs:1
geogr:1
conjugations:1
astrologer:1
nirukta:1
stuti:1
pregnancy:1
jugglery:1
strategy:1
gṛhyaprayoga:1
aphrodisiacs:1
metr:1
kṛṣiśāstra:1
khaṇḍakāvya:1
land-surveying:1
ethics:1
archery:1
painting:1
@drdhaval2785
I'm having trouble recreating acc4.txt. Have made no changes to the dev server version.
I'm referring to pywork/update.sh and to pywork/correctionwork/issue-cologne-142/readme.md.
update.sh looks fishy, as there are TWO steps which appear to modify orig/acc4.txt.
Also, update.sh instructions seem inconsistent with the instructions under '## Redo step' of readme.md.
Hope you can revise instructions as needed so acc4.txt can be properly reconstructed.
@funderburkjim
I will see and make necessary changes and intimate you here on github.
¦ [^ A-Z]+[.,]
finds out the possible candidates for subjects in ACC. e.g. jy poet archit vedānta śilpa tantr vocabulary dh