abbreviation preparation

funderburkjim commented 4 years ago

Via Email, a user, James, expressed an interest in identifying the abbreviations in Vacaspatyam dictionary.

In part, he said:

One thing we could try, and would probably be fairly fruitful.  If from your side you could create a 
list of the abbreviations, then I could see if I can crowdsource the names of the referenced texts  
(the expansions you mention) through our Indology list. 
There's an amazing amount of knowhow on the list, one is continually surprised 
by the depth and breadth of responses to really arcane questions.

This issue devoted to getting started with this.

The basic problem is that there is no known list of abbreviation expansions for VCP.

funderburkjim commented 4 years ago

Identification of abbreviations in VCP

Abbreviations in VCP may be identified as words ending in the digit '0'.

There are about 4000 distinct such abbreviations in our digitization vcp.txt.

Grammar and Literary Source

Examination of a few examples leads me to think that there are basically 2 types of abbreviations:

Grammatical For example 'BvA0' for 'BvAdi' -- The 'class 1' verbs

Mostly, I think, these occur on the first couple of lines of an entry.

For example, with root 'gama', the first two lines are

गम गतौ भ्वा० पर० अनिट् । गच्छति ऌदित् अगमत् ज-
गाम जग्मतुः । गन्ता गम्यात् गमिष्यति । गन्ता गमी

Literary sources. These usually occur later in an entry (after the first two lines)

The 'first-two-lines' rule is just a rough preliminary indication of whether a given abbreviation should be thought of as grammatical or as a literary source.

funderburkjim commented 4 years ago

a first display

Please see abbrev0_roman_100.md.

The file name indicates certain details of the display:

The Sanskrit words are represented in Roman Unicode (IAST)
- Other forms could use Devanagari or SLP1
The file and its links are written in markdown
- another option would be html
Only abbreviations with at least 100 observed instances are included. There are 154 such abbreviations
- There are 3900+ distinct abbreviations (at least 1 observed instance)
- There are 500+ distinct abbreviations with 10 or more instances.

154 instances is a workable number. If we make progress on these most common abbreviations, then we can later tackle other less common abbreviations.

funderburkjim commented 4 years ago

structure of first display

The display is a table whose columns are:

a sequence number
the abbreviation
the number of occurrences of the abbreviation in the first 2 lines of some entry
the number of occurrences of the abbreviation not in the first 2 lines of some entry
Up to 5 headword links where the abbreviation occurs in first 2 lines
Up to 5 headword links where the abbreviation occurs after the first 2 lines

The links show one of the Cologne displays for the word in VCP dictionary.

funderburkjim commented 4 years ago

I've communicated back by email to James. We'll have to see what else might be needed to facilitate crowd-sourcing.

gasyoun commented 4 years ago

The basic problem is that there is no known list of abbreviation expansions for VCP.

There are some hints. Remember the Tirupati edition files, it had some literary sources? abbrev0_roman_100.md so very well done.

154 instances is a workable number. If we make progress on these most common abbreviations, then we can later tackle other less common abbreviations.

Agree.

funderburkjim commented 4 years ago

As a reminder, vac.txt is our copy of the Tirupati edition.

First glance at vac.txt shows markup with tags such as

vkr : vikrama = grammatical information

It would be useful to

get a list of all the tags used in Tirupati markup,
estimates of what the tags stand for (like 'vkr' stands for 'vikrama') and perhaps how the tags could be made use of.
- Need help from a Sanskrit grammarian here.

I don't see definite markup of the abbreviations in vac.txt.

Andhrabharati commented 3 years ago

First glance at vac.txt shows markup with tags such as
* vkr  :  vikrama  = grammatical information

<vkr> is not for विक्रम; it is for व्याकरण.

Andhrabharati commented 3 years ago

I've communicated back by email to James. We'll have to see what else might be needed to facilitate crowd-sourcing.

Any result from this crowd-sourcing?

More often than not, our observation is that nothing get started, once the work is "allotted". ----------------

Only abbreviations with at least 100 observed instances are included. There are 154 such abbreviations
* There are 3900+ distinct abbreviations  (at least 1 observed instance)

* There are 500+ distinct abbreviations with 10 or more instances.

May I ask @funderburkjim to post the complete list here (preferably in Devanagari)?

I tried making one such list myself. VCP abbreviations extracted.txt

**1. This shows that many entries are with spelling and spacing errors (I did remove some of them, but then stopped).

Also quite many of these could be clubbed together as comp. abbr.s, instead of keeping as separate ones.
Many are variant forms of the same "source".
And finally quite many others are without the trailing '0', either in the text or in the print itself.**

Andhrabharati commented 3 years ago

@gasyoun If you can trace your Tirupati CD, can you post a link to download it? I also purchased the CD, but need to locate it.

funderburkjim commented 3 years ago

post the complete list here (preferably in Devanagari)?

A complete devanagari list (as github markdown table) is in two parts:

part1
part2

The one-part form is also prepared, but is too big for github to display properly.

There is also a simpler list, with each abbreviation and its frequency, at abbrev0_deva_all.txt.

This should be comparable to VCP.abbreviations.extracted.txt from @Andhrabharati (see a previous comment for link).

gasyoun commented 3 years ago

I don't see definite markup of the abbreviations in vac.txt

Tirupati people are famous for bad documenting, so it's the same with the Tirupati edition of digital Ramayana.

If you can trace your Tirupati CD, can you post a link to download it?

The file we have at Cologne is based on the CD Usha sent me. That is, nothing else about it.

Analysis done in 2014.

f_WX.txt Vacaspatyam_15_01_2014_b1.xlsx Vachaspatyam.xlsx Vachaspatyam_b3_with_dev.xlsx Vachaspatyam_b4_without_dev.xlsx Vachaspatyam_b5_proof_1673.xlsx Vachaspatyam_b6_proof_1673-06-01-14.xlsx

funderburkjim commented 3 years ago

The file we have at Cologne is based on the CD Usha sent me

The Tirupati vacaspatyam I started with in this repository is vac_input.txt. According to my notes in the readme.org of vcpte-vac,

By some unknown process, Scharf and colleagues reformatted and modified presumably the same original Tirupati edition of Vacaspatyam.

gasyoun commented 3 years ago

Scharf and colleagues reformatted and modified presumably the same original Tirupati edition of Vacaspatyam.

Oh, so you believe the two versions have the same source initially.

funderburkjim commented 3 years ago

The Tirupati version I got from Scharf had already been put into SLP1. I don't know what source Peter started with; but since you mentioned the existence of a CD made by Tirupati, it may be that Peter started from that cd.

gasyoun commented 3 years ago

Tirupati version I got from Scharf had already been put into SLP1

Oh, ok, because when I saw the CD it was in that funny WX encoding. And contained not only the dictionary file, but several additional, including the Preface.

Andhrabharati commented 3 years ago

Finally done with the first phase of Vacaspatyam corrections, starting mainly with the abbr. markers, in a focused effort for two weeks; and the summary is in the file below-

Phase-1 of work on Vacaspatyam.txt

Almost all the abbr.s are resolved now!!

It appears that the present Cologne data has missed the dual/variant forms of the HWs (marked in parenthesis in the print), and also many errors are noticed.

Hence it is desirable to correct the HWs portion (before touching the body portion), which I would like to take up in next few days.

gasyoun commented 3 years ago

HWs portion (before touching the body portion)

Yeah, headwords is what comes first. Thanks for the hard work @Andhrabharati

Almost all the abbr.s are resolved now!!

But where to look for them?

funderburkjim commented 3 years ago

@Andhrabharati There are a lot of 'extra' headwords at https://github.com/sanskrit-lexicon/csl-orig/blob/master/v02/vcp/vcp_hwextra.txt

These probably include many of the 'dual/variant forms' .

Andhrabharati commented 3 years ago

Yes @funderburkjim, I've seen this file as well as the Vachaspatyam-Doubles-16.3.15.xlsx file from @gasyoun.

As I saw, there are some errors in both the files.

So decided to do it again myself, while looking for HW errors throughout.

sanskrit-lexicon / VCP