Open funderburkjim opened 4 years ago
Abbreviations in VCP may be identified as words ending in the digit '0'.
There are about 4000 distinct such abbreviations in our digitization vcp.txt.
Examination of a few examples leads me to think that there are basically 2 types of abbreviations:
गम गतौ भ्वा० पर० अनिट् । गच्छति ऌदित् अगमत् ज-
गाम जग्मतुः । गन्ता गम्यात् गमिष्यति । गन्ता गमी
The 'first-two-lines' rule is just a rough preliminary indication of whether a given abbreviation should be thought of as grammatical or as a literary source.
Please see abbrev0_roman_100.md.
The file name indicates certain details of the display:
154 instances is a workable number. If we make progress on these most common abbreviations, then we can later tackle other less common abbreviations.
The display is a table whose columns are:
The links show one of the Cologne displays for the word in VCP dictionary.
I've communicated back by email to James. We'll have to see what else might be needed to facilitate crowd-sourcing.
The basic problem is that there is no known list of abbreviation expansions for VCP.
There are some hints. Remember the Tirupati edition files, it had some literary sources? abbrev0_roman_100.md so very well done.
154 instances is a workable number. If we make progress on these most common abbreviations, then we can later tackle other less common abbreviations.
Agree.
As a reminder, vac.txt is our copy of the Tirupati edition.
First glance at vac.txt shows markup with tags such as
It would be useful to
I don't see definite markup of the abbreviations in vac.txt.
First glance at vac.txt shows markup with tags such as
* vkr : vikrama = grammatical information
<vkr>
is not for विक्रम; it is for व्याकरण.
I've communicated back by email to James. We'll have to see what else might be needed to facilitate crowd-sourcing.
Any result from this crowd-sourcing?
More often than not, our observation is that nothing get started, once the work is "allotted".
----------------
Only abbreviations with at least 100 observed instances are included. There are 154 such abbreviations
* There are 3900+ distinct abbreviations (at least 1 observed instance) * There are 500+ distinct abbreviations with 10 or more instances.
May I ask @funderburkjim to post the complete list here (preferably in Devanagari)?
I tried making one such list myself. VCP abbreviations extracted.txt
**1. This shows that many entries are with spelling and spacing errors (I did remove some of them, but then stopped).
@gasyoun If you can trace your Tirupati CD, can you post a link to download it? I also purchased the CD, but need to locate it.
post the complete list here (preferably in Devanagari)?
A complete devanagari list (as github markdown table) is in two parts:
The one-part form is also prepared, but is too big for github to display properly.
There is also a simpler list, with each abbreviation and its frequency, at abbrev0_deva_all.txt.
This should be comparable to VCP.abbreviations.extracted.txt from @Andhrabharati (see a previous comment for link).
I don't see definite markup of the abbreviations in vac.txt
Tirupati people are famous for bad documenting, so it's the same with the Tirupati edition of digital Ramayana.
If you can trace your Tirupati CD, can you post a link to download it?
The file we have at Cologne is based on the CD Usha sent me. That is, nothing else about it.
Analysis done in 2014.
f_WX.txt Vacaspatyam_15_01_2014_b1.xlsx Vachaspatyam.xlsx Vachaspatyam_b3_with_dev.xlsx Vachaspatyam_b4_without_dev.xlsx Vachaspatyam_b5_proof_1673.xlsx Vachaspatyam_b6_proof_1673-06-01-14.xlsx
The file we have at Cologne is based on the CD Usha sent me
The Tirupati vacaspatyam I started with in this repository is vac_input.txt. According to my notes in the readme.org of vcpte-vac,
By some unknown process, Scharf and colleagues reformatted and modified presumably the same original Tirupati edition of Vacaspatyam.
Scharf and colleagues reformatted and modified presumably the same original Tirupati edition of Vacaspatyam.
Oh, so you believe the two versions have the same source initially.
The Tirupati version I got from Scharf had already been put into SLP1. I don't know what source Peter started with; but since you mentioned the existence of a CD made by Tirupati, it may be that Peter started from that cd.
Tirupati version I got from Scharf had already been put into SLP1
Oh, ok, because when I saw the CD it was in that funny WX encoding. And contained not only the dictionary file, but several additional, including the Preface.
Finally done with the first phase of Vacaspatyam corrections, starting mainly with the abbr. markers, in a focused effort for two weeks; and the summary is in the file below-
Phase-1 of work on Vacaspatyam.txt
Almost all the abbr.s are resolved now!!
It appears that the present Cologne data has missed the dual/variant forms of the HWs (marked in parenthesis in the print), and also many errors are noticed.
Hence it is desirable to correct the HWs portion (before touching the body portion), which I would like to take up in next few days.
HWs portion (before touching the body portion)
Yeah, headwords is what comes first. Thanks for the hard work @Andhrabharati
Almost all the abbr.s are resolved now!!
But where to look for them?
@Andhrabharati There are a lot of 'extra' headwords at https://github.com/sanskrit-lexicon/csl-orig/blob/master/v02/vcp/vcp_hwextra.txt
These probably include many of the 'dual/variant forms' .
Yes @funderburkjim, I've seen this file as well as the Vachaspatyam-Doubles-16.3.15.xlsx file from @gasyoun.
As I saw, there are some errors in both the files.
So decided to do it again myself, while looking for HW errors throughout.
Via Email, a user, James, expressed an interest in identifying the abbreviations in Vacaspatyam dictionary.
In part, he said:
This issue devoted to getting started with this.
The basic problem is that there is no known list of abbreviation expansions for VCP.