unmarked abbreviations - Githubissues

funderburkjim commented 2 years ago

There are quite a few unmarked abbreviations in pw.txt. Derive a procedure for identifying and marking many of these.

funderburkjim commented 1 year ago

@Andhrabharati Your corrections now installed locally. Nice detective work with 'a. u.' I'll post changes later. These are in change_pw_6.txt and change_pw_ab_6.txt

funderburkjim commented 1 year ago

@Andhrabharati I am beginning work to extract your <is> changes. Question: 424 matches in 416 lines for "</is> <is>" Was this intentional? For instance, 16 matches for "<is>Manu</is> <is>Vaivasvata</is>" Why not code this as <is>Manu Vaivasvata</is> ? And similarly for the rest of the 424?

Andhrabharati commented 1 year ago

It is intentional @funderburkjim; for most of the times the first (or sometimes, the second) entity denotes the class/category [Manu, Asura, Apsara etc.] under which the "proper" name is mentioned as the other entity.

Of course, I had clubbled a few of these in my next version (AB v2), like <is>Bṛhat Sāman</is>, <is>Uttara Phalgunī</is> and <is>Viśve Devās</is> for they do not come under the above criteria!

And in few cases, they should never be clubbed together, such as when the second entity has a trailing possessive ʼs, or is a <is>gaṇa</is> [a group in the Gaṇapāṭha] of which the first entity is a member.

Andhrabharati commented 1 year ago

BTW, you may note that the CDSL version has all these marked as individual entities**** only!

funderburkjim commented 1 year ago

<iw> is a new tag. (26 instances). What is it for?

Andhrabharati commented 1 year ago

You will get the answer at this post, @funderburkjim !

funderburkjim commented 1 year ago

`<is>` and `<iw>` tag corrections

Changes made to cdsl version to agree with the AB version.

change_pw_6a.txt about 2800 lines changed
change_pw_ab_6a.txt Only 12 changes. @Andhrabharati should confirm these.

The cdsl and AB version now agree in both global and local instances of the <is> tags, and the much fewer (25) <iw> tags.

is_glob6a.txt provides a frequency llist (about 3100 distinct instances)
is_local6a.txt (about 100 distinct instances).

These changes correct spelling and markup errors related to the widely-space text appearing in PW text. Thanks to @Andhrabharati for identifying these errors and providing corrections.

funderburkjim commented 1 year ago

italic text

In #95 @Andhrabharati identifies errors in the scope of italic text. I'll take these up next.

Andhrabharati commented 1 year ago

change_pw_ab_6a.txt Only 12 changes. @Andhrabharati should confirm these.

I prefer to differ in 4 cases out of 12, as mentioned below--

(384812, 384871) Uttara is with short a when the constellations Phalgunī and Bhādrapadā are being mentioned (which also have corresp. Pūrva, with a short a) See the cases of Phalgunī at 44781, 72324, 72504, 336638, 370552, 530146. So these two should be treated as print errors. ----------------------- (483261) Fürstin (Princess) occurs with a possessive form 'der' (of) of a place/State; so "Nepalʼs" is the correct form (a print error), and not "Nepals" (plural form) ----------------------- (614332) Marut is also a class (whose count is 7) such as Manu, Vasu, Indra, Rudra etc.; so the proper names of those are better be separately mentioned. [Of course 614323 could be misleading with the small initial in the name!!]

Andhrabharati commented 1 year ago

is_local6a.txt (about 100 distinct instances).

<is n="Sarasvatiī">S.</is> -> <is n="Sarasvatī">S.</is> and thus, <is n="Sarasvatī">S.</is> count to be changed as 3.

Andhrabharati commented 1 year ago

is_glob6a.txt provides a frequency llist (about 3100 distinct instances)

The file has 3700 entries, not 3100.

funderburkjim commented 12 months ago

@Andhrabharati Please look again at (384812, 384871). The short-a appears as a compound: 5 matches for "<is>Uttaraphalgunī. But 384812, and 871 appear as Adjective+noun : So long-a is proper: Uttarā Phalgunī So I think these two should NOT be changed. Agree?

Andhrabharati commented 12 months ago

@Andhrabharati Please look again at (384812, 384871). ... ... ... So I think these two should NOT be changed. Agree?

Reg. 384812: While pwk has it as

the parent PWG had it as

and the "close follower" MW has it as

Reg. 384871: While pwk has it as

the parent PWG had it as [though this doesn't contain the word Uttara, the preceding (and deciding!) german adj. späteren should be helpful.]

and the "close follower" MW has it as

With the above snippets, you may decide what to keep in pwk, @funderburkjim.

Finally, a general observation from my side: Though pwk is kind of better in having consistent style of marking and 'framing' the text as compared to PWG, it has quite many print errors; so one has to be pretty vigilent before taking its text as "granted".

funderburkjim commented 12 months ago

italic text markup

Next versions : temp_pw_7.txt (CDSL) (see csl-orig commit above) and temp_pw_ab_7.txt (Andhrabharati) temp_pw_ab_7.zip

These versions agree in the COUNT of italic text sections for each entry.
ablists/readme.txt provides some guidance as to the method (s).

change_pw_7.txt 12000+ lines changed from version 6a.
change_pw_ab_7.txt approximately 300 lines changed
- The biggest portion (part 2: 266) were confirmed by reference to scans.
- Part 3 includes the additions from AB at this comment above.
  - <is>Uttarā Phalgunī</is> changed to <is>Uttara Phalgunī</is> in temp_pw_ab_7.txt
  - These two changes NOT made in temp_pw_7.txt.
is_local7.txt and is_glob7.txt show minor revisions from 6a versions.

As in previous version, request @Andhrabharati to review change_pw_ab_7. txt.

Version 8 is anticipated to focus on a few punctuation pattern differences between cdsl and AB versions.

funderburkjim commented 12 months ago

Regarding punctuation changes, this comment in issue95 seems a good place to start.

Andhrabharati commented 12 months ago

Part 3 includes the additions from AB at this comment above.
* `<is>Uttarā Phalgunī</is>` changed to `<is>Uttara Phalgunī</is>` in temp_pw_ab_7.txt

* These two changes NOT made in temp_pw_7.txt.

Though I did not explicitly disagree above, I had expected that you'd change these two instances based on my reply, @funderburkjim !!

As in previous version, request @Andhrabharati to review change_pw_ab_7. txt.

I will take a look at these tomorrow morning, as I am just about to retire for the day.

Andhrabharati commented 12 months ago

As in previous version, request @Andhrabharati to review change_pw_ab_7. txt.

I will take a look at these tomorrow morning

Checked a few randomly, and seen that I had deliberately skipped those in a quicker work [probably I also should've taken the step-by-step process as adopted by Jim!] with an intention of doing a full (slower) reading of the text once wrt the print, as it was felt necessary for other reasons.

As such, all these could go "as corrected by Jim" into the file.

gasyoun commented 12 months ago

probably I also should've taken the step-by-step process as adopted by Jim

Good to hear such words finally from you.

Andhrabharati commented 12 months ago

It was just a passing statement, @gasyoun !!

The volume of work vs. time spent (at my end) does not let me take this path.

Whatever is the time taken, errors are bound to be there. [Noticed a few errors in Jim's work of change_pw_ab_7.txt listing just about 300 instances!] So, I give preference to covering more "volume" in lesser "time".

funderburkjim commented 11 months ago

punctuation at end of italics

Except for a few cases (which I am changing), periods come AFTER the close of italic text. e.g '%}.' in AB's version. [And I intend to modify the cdsl version similarly.]

For comma and semicolon, normally the punctuation is shown BEFORE the close of italic text.

3131 matches in 2954 lines for ";%}" in buffer: temp_pw_ab_7.txt
1 match for "%};" in buffer: temp_pw_ab_7.txt

5956 matches in 5671 lines for ",%}" in buffer: temp_pw_ab_7.txt
10 matches for "%}," in buffer: temp_pw_ab_7.txt

I think we should put the comma, semicolon AFTER. @Andhrabharati Agree?

funderburkjim commented 11 months ago

few errors in Jim's work of change_pw_ab_7.txt

I'd like to correct such errors -- would you identify the errors you noticed?

gasyoun commented 11 months ago

bhagavat

Space lacking? @funderburkjim

Andhrabharati commented 11 months ago

I think we should put the comma, semicolon AFTER. @Andhrabharati Agree?

No, I feel they should be within the italics only, as the print has those characters 'slanted' at those places mostly [probably some exceptions may be present due to oversight].

Andhrabharati commented 11 months ago

I'd like to correct such errors -- would you identify the errors you noticed?

I do not want to spend any time on this (as this is a trivial issue); but I do stand by my above statement (having noticed them while changing the text in my file as per your correction).

Andhrabharati commented 11 months ago

@funderburkjim

I have finished marking the full pw set (PWG, PWGVN, pwk, pwkvn and sch) all in a similar format, incl. abbr./lex./ls./... taggings. [I had done more than what I did in PWG earlier (in my own format) now in CDSL (close to!) format; you may recall our discussion two years back]

Now I thought of going to the "pure" Skt. works SKD and VCP, and noticed that the conversion/transcoding scripts you had provided for pw family do not work on them.

Would you pl. take some time to make the conversion scripts for these, to enable me 'transferring' the work I did previously in my (self-converted) files [obviously in a widely differring format!] to CDSL files?

Andhrabharati commented 11 months ago

And I shall post my above files of the pw set, whenever you would be willing to "work" on/with them.

Andhrabharati commented 11 months ago

BTW, wanted to have your opinion about a piece of work in PWG--

abbr. expansion as another abbr.!! is it alright and acceptable? <ab n="Nom.">N.</ab> 6 times under L-238 <ab n="Voc.">V.</ab> thrice under L-238 <ab n="vgl.">v.</ab> once under L-46007

or should they be expanded to their 'full' forms?

gasyoun commented 11 months ago

or should they be expanded to their 'full' forms?

Makes sense

Now I thought of going to the "pure" Skt. works SKD and VCP

Amazing. What aspects do you plan to cover?

Andhrabharati commented 11 months ago

On a 2nd thought, probably the print text could be made/changed as the 3 lettered abbr. form from the present single letter, and then have them follow the 'regular' process of ab expansion.

These are the exceptional cases noticed while I was on PWG recently.

funderburkjim commented 11 months ago

punctuation changes

135334 lines of pw changed (temp_pw_8.txt)

See punct/readme.txt for discussion.

19 lines changed: temp_pw_ab_8.zip See also change_pw_ab_8.txt.

funderburkjim commented 11 months ago

BTW, wanted to have your opinion about a piece of work in PWG

My opinion at the moment:

local abbreviations expanded to their 'full' forms.
Use the abbreviation text as in scan.

funderburkjim commented 11 months ago

post my above files of the pw set

There are still several aspects of your revision to PW(K) that remain for me to resolve. When these are done, that will be the time for me to do similar convergence of cdsl to your versions for PWG, PWKVN, and SCH.

gasyoun commented 11 months ago

to do similar convergence of cdsl

How time-consuming are these?

@funderburkjim punctuation cleanup intended of the rest of the family, right? https://github.com/sanskrit-lexicon/PWK/blob/master/pwkissues/issue88/punct/readme.txt

funderburkjim commented 11 months ago

punctuation cleanup intended of the rest of the family, right?

Yes, That seems to be the focus now.

funderburkjim commented 11 months ago

versions 9

Continue with resolution of differences between CDSL and AB versions.

local cdsl version temp_pw_9e.txt, installed at Cologne at commit 576e9ca. About 7% of lines changed (49000+ lines).

AB version temp_pw_ab_9.zip 161 lines changed. @Andhrabharati see change_pw_ab_9.txt for revisions to your version.

Work was done in the zoobot directory.
For a summary of the changes, see readme_zoobot_summary.md

I'll continue this resolution process -- while often tedious, I find this interesting. The work done by AB reminds me of the 'early days' of work with MW.

Andhrabharati commented 11 months ago

@funderburkjim

There still are 6 cases of comma preceded by a space in pw.txt (CDSL)
There are 92 cases of ',', 2 cases of ';' and one case of '.' at the line beginning in pw.txt (CDSL)
In removing the "oder" from the bot tag and splitting it into two portions, I see a small issue-- that one of the portion remains "partial" (mostly the second part of the botanical name, the first part may have to be "filled up" appropriately; and less frequently the first part remains, with the 2nd part to be "padded" appropriately). However, I have corrected my present file similar to your pw_ab_9 for the time being.
Just like to mention that my revised file has 8290 bot tags (against 8128 of pw.txt) and 66 zoo tags (against 2 of pw.txt); and I had marked all the 3 occurrances of Dass. as an abbr. [I had already mentioned earlier about my latest ab list having this additionally.]

Andhrabharati commented 11 months ago

I'll continue this resolution process -- while often tedious, I find this interesting.

@funderburkjim Glad to see you spending time in the process. I presume it is a worthwhile exercise [and like to remind that I had proposed the same long back, but somehow those days the idea was not entertained for some reason(s)].

to do similar convergence of cdsl

How time-consuming are these?

@gasyoun The way pw changes are happenning (I took about 3 weeks for pwk & pwkvn and Jim is still on pwk alone for nearly 3 months now), I estimate the process to take anything from 6 months to over an year at Jim's end. [I took about 6 weeks for PWG and a week on SCH (and left it there "unfinished").]

Just wondering if I should take this much time from Jim, to engage him on a "single" work [contrary to what I have said above, that it could be a worthwhile spending!!].

funderburkjim commented 11 months ago

revise temp_pw_9e version

Based primarily on the 4 points mentioned above. See part 3 of change_pw_9e.txt.

With one exception.

Based on your comment 4, you have discovered new bot and zoo.
Perhaps you could provide your latest file, from which I could harvest these new plants and animals.

Andhrabharati commented 11 months ago

Perhaps you could provide your latest file, from which I could harvest these new plants and animals.

My latest file has over 144k changes wrt the prev. file, so I am afraid you won't like so many changes.

The primary change is in the pc value of the metalines-- the p being split as vol-page (just as in PWG, SKD etc.) and the c being changed to (a,b,c) from (1,2,3) (as in SKD etc.); but this change might require you to correct in multiple places wherever pwk display is involved.
The second change is in ls marking, with quite many corrections and mergers(!), esp. where the Chr. is "padded" to plain number-chains.

... ... ...

So, I thought I should restrict myself to give just the relevant (additional) bot and zoo strings to mark in your file.

Addl. zoo tag entities: 62

Acheriris Kokor Zibha (1) Antilope cervicapra (1) Ardea Argala (1) Ardea nivea (20) Ardea sibirica (27) Coluber Naga (8) Lacerta Godica (1) Noctua indica (1) Tantalus flacinellus (1) Unguis odoratus (1)

Hope this is alright for you!!

funderburkjim commented 11 months ago

Additional bot tags also?

Andhrabharati commented 11 months ago

It may be noted that many of the above zoo tags (if not all) are changed from bot tags!

Presuming that you have some mechanism of correlating the pw_ab_9.txt with pw.txt, here are the pw_ab_9 lines containing the addl. bot tags-- Addl. bot tags (in pw_ab_9 lines).txt

funderburkjim commented 11 months ago

zoo tag count differences

I found different counts for:

Antilope cervicapra 1 2  (1 AB above, 2 CDSL)
Ardea nivea 20 21
Lacerta Godica 1 3
Unguis odoratus 1 24

These differences in both pw_9e and pw_ab_9

Andhrabharati commented 11 months ago

So, I was somewhat careless while doing this portion!!

Andhrabharati commented 11 months ago

A quick relooking into the data at my end resulted in the following zoo tag counts--

Acheriris Kokor Zibha (1) Antilope cervicapra (2) Antilope picta (3) Ardea Argala (1) Ardea Govinda (1) Ardea jaculator (2) Ardea nivea (21) Ardea sibirica (27) Ardea virgo (2) Coluber Naga (8) Coluber naga (2) Lacerta Godica (3) Musculus cucullaris s. trapezius (1) Noctua indica (1) Tantalus flacinellus (1) Turdus Ginginianus (1) Turdus Salica (4) Turdus ginginianus H. (1) Turdus macrourus (3) Unguis odoratus (25)

funderburkjim commented 11 months ago

additional bot/zoo added

Cdsl version (= local temp_pw_9f.txt) and AB version (revised local pw_ab_9.txt) now agree in bot/zoo tags. (8259 bot tags,. 96 zoo tags).

For changes to pw_ab_9, see change_pw_ab_9.txt, Parts 11-14.
For changes to pw_9f, see change_pw_9f.txt, A few additional 'punctuation' changes identified and changed.

Upon return to this issue, I intend to harvest the <hom> tag markup from AB's version. Then resolve all differences in Devanagari and italic texts.

Andhrabharati commented 11 months ago

One bad news that all my work over past 4 years has gone missing, due to hard disk crash. [This is mostly the cdsl stuff, unfortunately.]

In a way, this is good that I don't have to 'bother' Jim with all my 'sundry' and 'extraneous' works. [I think no one might have pestered him, like I did all these days.]

However, it is somewhat consoling that except the PWG set (that happened during last few months, and a "major" work it was!!) and the MW set (thorough revision that's going on), most of the work has already been posted at github.

Andhrabharati commented 11 months ago

I shall try retrieving the data from the crashed disk, but highly doubtful if I could succeed in this.

funderburkjim commented 11 months ago

Wow! That is a difficult situation. Hope you can retrieve the lost work. If I had a similar problem, I would probably try spinrite (google search spinrite hard drive recovery by Steve Gibson at grc.com).

BTW I don't mind being 'pestered' in a good cause. I hope you will be able to resume your useful contributions to the cdsl work.

gasyoun commented 11 months ago

I hope you will be able to resume your useful contributions to the cdsl work.

So do I. You do not use cloud storage @Andhrabharati ?

Andhrabharati commented 11 months ago

I could get almost all the PWG, pwk and SCH stuff done at my end; but the pwkvn (recent) work is lost completely.

Also got the MW recent work back.

sanskrit-lexicon / PWK

unmarked abbreviations #88

`<is>` and `<iw>` tag corrections

italic text

italic text markup

punctuation at end of italics

punctuation changes

versions 9

revise temp_pw_9e version

With one exception.

zoo tag count differences

additional bot/zoo added

sanskrit-lexicon / PWK

unmarked abbreviations #88

<is> and <iw> tag corrections

italic text

italic text markup

punctuation at end of italics

punctuation changes

versions 9

revise temp_pw_9e version

With one exception.

zoo tag count differences

additional bot/zoo added

`<is>` and `<iw>` tag corrections