Closed funderburkjim closed 10 months ago
@Andhrabharati Your corrections now installed locally. Nice detective work with 'a. u.' I'll post changes later. These are in change_pw_6.txt and change_pw_ab_6.txt
@Andhrabharati I am beginning work to extract your <is>
changes.
Question: 424 matches in 416 lines for "</is> <is>"
Was this intentional?
For instance, 16 matches for "<is>Manu</is> <is>Vaivasvata</is>"
Why not code this as <is>Manu Vaivasvata</is>
?
And similarly for the rest of the 424?
It is intentional @funderburkjim; for most of the times the first (or sometimes, the second) entity denotes the class/category [Manu, Asura, Apsara etc.] under which the "proper" name is mentioned as the other entity.
Of course, I had clubbled a few of these in my next version (AB v2), like <is>Bṛhat Sāman</is>
, <is>Uttara Phalgunī</is>
and <is>Viśve Devās</is>
for they do not come under the above criteria!
And in few cases, they should never be clubbed together, such as when the second entity has a trailing possessive ʼs
, or is a <is>gaṇa</is>
[a group in the Gaṇapāṭha] of which the first entity is a member.
BTW, you may note that the CDSL version has all these marked as individual entities**** only!
<iw>
is a new tag. (26 instances). What is it for?
You will get the answer at this post, @funderburkjim !
<is>
and <iw>
tag correctionsChanges made to cdsl version to agree with the AB version.
The cdsl and AB version now agree in both global and local instances of the <is>
tags,
and the much fewer (25) <iw>
tags.
These changes correct spelling and markup errors related to the widely-space text appearing in PW text. Thanks to @Andhrabharati for identifying these errors and providing corrections.
In #95 @Andhrabharati identifies errors in the scope of italic text. I'll take these up next.
- change_pw_ab_6a.txt Only 12 changes. @Andhrabharati should confirm these.
I prefer to differ in 4 cases out of 12, as mentioned below--
(384812, 384871)
Uttara is with short a when the constellations Phalgunī and Bhādrapadā are being mentioned (which also have corresp. Pūrva, with a short a)
See the cases of Phalgunī at 44781, 72324, 72504, 336638, 370552, 530146.
So these two should be treated as print errors.
-----------------------
(483261)
Fürstin (Princess) occurs with a possessive form 'der' (of) of a place/State; so "Nepalʼs" is the correct form (a print error), and not "Nepals" (plural form)
-----------------------
(614332)
Marut is also a class (whose count is 7) such as Manu, Vasu, Indra, Rudra etc.; so the proper names of those are better be separately mentioned.
[Of course 614323 could be misleading with the small initial in the name!!]
- is_local6a.txt (about 100 distinct instances).
<is n="Sarasvatiī">S.</is>
-> <is n="Sarasvatī">S.</is>
and thus, <is n="Sarasvatī">S.</is>
count to be changed as 3.
- is_glob6a.txt provides a frequency llist (about 3100 distinct instances)
The file has 3700 entries, not 3100.
@Andhrabharati Please look again at (384812, 384871).
The short-a appears as a compound: 5 matches for "<is>Uttaraphalgunī
.
But 384812, and 871 appear as Adjective+noun : So long-a is proper: Uttarā Phalgunī
So I think these two should NOT be changed.
Agree?
@Andhrabharati Please look again at (384812, 384871). ... ... ... So I think these two should NOT be changed. Agree?
Reg. 384812: While pwk has it as
the parent PWG had it as
and the "close follower" MW has it as
Reg. 384871: While pwk has it as
the parent PWG had it as [though this doesn't contain the word Uttara, the preceding (and deciding!) german adj. späteren should be helpful.]
and the "close follower" MW has it as
With the above snippets, you may decide what to keep in pwk, @funderburkjim.
Finally, a general observation from my side: Though pwk is kind of better in having consistent style of marking and 'framing' the text as compared to PWG, it has quite many print errors; so one has to be pretty vigilent before taking its text as "granted".
Next versions : temp_pw_7.txt (CDSL) (see csl-orig commit above) and temp_pw_ab_7.txt (Andhrabharati) temp_pw_ab_7.zip
These versions agree in the COUNT of italic text sections for each entry.
ablists/readme.txt provides some guidance as to the method (s).
<is>Uttarā Phalgunī</is>
changed to <is>Uttara Phalgunī</is>
in temp_pw_ab_7.txt As in previous version, request @Andhrabharati to review change_pw_ab_7. txt.
Version 8 is anticipated to focus on a few punctuation pattern differences between cdsl and AB versions.
Regarding punctuation changes, this comment in issue95 seems a good place to start.
Part 3 includes the additions from AB at this comment above.
* `<is>Uttarā Phalgunī</is>` changed to `<is>Uttara Phalgunī</is>` in temp_pw_ab_7.txt * These two changes NOT made in temp_pw_7.txt.
Though I did not explicitly disagree above, I had expected that you'd change these two instances based on my reply, @funderburkjim !!
As in previous version, request @Andhrabharati to review change_pw_ab_7. txt.
I will take a look at these tomorrow morning, as I am just about to retire for the day.
As in previous version, request @Andhrabharati to review change_pw_ab_7. txt.
I will take a look at these tomorrow morning
Checked a few randomly, and seen that I had deliberately skipped those in a quicker work [probably I also should've taken the step-by-step process as adopted by Jim!] with an intention of doing a full (slower) reading of the text once wrt the print, as it was felt necessary for other reasons.
As such, all these could go "as corrected by Jim" into the file.
probably I also should've taken the step-by-step process as adopted by Jim
Good to hear such words finally from you.
It was just a passing statement, @gasyoun !!
The volume of work vs. time spent (at my end) does not let me take this path.
Whatever is the time taken, errors are bound to be there. [Noticed a few errors in Jim's work of change_pw_ab_7.txt listing just about 300 instances!] So, I give preference to covering more "volume" in lesser "time".
Except for a few cases (which I am changing), periods come AFTER the close of italic text. e.g '%}.' in AB's version. [And I intend to modify the cdsl version similarly.]
For comma and semicolon, normally the punctuation is shown BEFORE the close of italic text.
3131 matches in 2954 lines for ";%}" in buffer: temp_pw_ab_7.txt
1 match for "%};" in buffer: temp_pw_ab_7.txt
5956 matches in 5671 lines for ",%}" in buffer: temp_pw_ab_7.txt
10 matches for "%}," in buffer: temp_pw_ab_7.txt
I think we should put the comma, semicolon AFTER. @Andhrabharati Agree?
few errors in Jim's work of change_pw_ab_7.txt
I'd like to correct such errors -- would you identify the errors you noticed?
Space lacking? @funderburkjim
I think we should put the comma, semicolon AFTER. @Andhrabharati Agree?
No, I feel they should be within the italics only, as the print has those characters 'slanted' at those places mostly [probably some exceptions may be present due to oversight].
I'd like to correct such errors -- would you identify the errors you noticed?
I do not want to spend any time on this (as this is a trivial issue); but I do stand by my above statement (having noticed them while changing the text in my file as per your correction).
@funderburkjim
I have finished marking the full pw set (PWG, PWGVN, pwk, pwkvn and sch) all in a similar format, incl. abbr./lex./ls./... taggings. [I had done more than what I did in PWG earlier (in my own format) now in CDSL (close to!) format; you may recall our discussion two years back]
Now I thought of going to the "pure" Skt. works SKD and VCP, and noticed that the conversion/transcoding scripts you had provided for pw family do not work on them.
Would you pl. take some time to make the conversion scripts for these, to enable me 'transferring' the work I did previously in my (self-converted) files [obviously in a widely differring format!] to CDSL files?
And I shall post my above files of the pw set, whenever you would be willing to "work" on/with them.
BTW, wanted to have your opinion about a piece of work in PWG--
abbr. expansion as another abbr.!! is it alright and acceptable?
<ab n="Nom.">N.</ab>
6 times under L-238
<ab n="Voc.">V.</ab>
thrice under L-238
<ab n="vgl.">v.</ab>
once under L-46007
or should they be expanded to their 'full' forms?
or should they be expanded to their 'full' forms?
Makes sense
Now I thought of going to the "pure" Skt. works SKD and VCP
Amazing. What aspects do you plan to cover?
On a 2nd thought, probably the print text could be made/changed as the 3 lettered abbr. form from the present single letter, and then have them follow the 'regular' process of ab expansion.
These are the exceptional cases noticed while I was on PWG recently.
135334 lines of pw changed (temp_pw_8.txt)
See punct/readme.txt for discussion.
19 lines changed: temp_pw_ab_8.zip See also change_pw_ab_8.txt.
BTW, wanted to have your opinion about a piece of work in PWG
My opinion at the moment:
post my above files of the pw set
There are still several aspects of your revision to PW(K) that remain for me to resolve. When these are done, that will be the time for me to do similar convergence of cdsl to your versions for PWG, PWKVN, and SCH.
to do similar convergence of cdsl
How time-consuming are these?
@funderburkjim punctuation cleanup intended of the rest of the family, right? https://github.com/sanskrit-lexicon/PWK/blob/master/pwkissues/issue88/punct/readme.txt
punctuation cleanup intended of the rest of the family, right?
Yes, That seems to be the focus now.
Continue with resolution of differences between CDSL and AB versions.
local cdsl version temp_pw_9e.txt, installed at Cologne at commit 576e9ca. About 7% of lines changed (49000+ lines).
AB version temp_pw_ab_9.zip 161 lines changed. @Andhrabharati see change_pw_ab_9.txt for revisions to your version.
Work was done in the zoobot directory.
For a summary of the changes, see readme_zoobot_summary.md
I'll continue this resolution process -- while often tedious, I find this interesting. The work done by AB reminds me of the 'early days' of work with MW.
@funderburkjim
I'll continue this resolution process -- while often tedious, I find this interesting.
@funderburkjim Glad to see you spending time in the process. I presume it is a worthwhile exercise [and like to remind that I had proposed the same long back, but somehow those days the idea was not entertained for some reason(s)].
to do similar convergence of cdsl
How time-consuming are these?
@gasyoun The way pw changes are happenning (I took about 3 weeks for pwk & pwkvn and Jim is still on pwk alone for nearly 3 months now), I estimate the process to take anything from 6 months to over an year at Jim's end. [I took about 6 weeks for PWG and a week on SCH (and left it there "unfinished").]
Just wondering if I should take this much time from Jim, to engage him on a "single" work [contrary to what I have said above, that it could be a worthwhile spending!!].
Based primarily on the 4 points mentioned above. See part 3 of change_pw_9e.txt.
Based on your comment 4, you have discovered new bot and zoo.
Perhaps you could provide your latest file, from which I could harvest these new plants and animals.
Perhaps you could provide your latest file, from which I could harvest these new plants and animals.
My latest file has over 144k changes wrt the prev. file, so I am afraid you won't like so many changes.
The primary change is in the pc value of the metalines-- the p being split as vol-page (just as in PWG, SKD etc.) and the c being changed to (a,b,c) from (1,2,3) (as in SKD etc.); but this change might require you to correct in multiple places wherever pwk display is involved.
The second change is in ls marking, with quite many corrections and mergers(!), esp. where the Chr. is "padded" to plain number-chains.
... ... ...
So, I thought I should restrict myself to give just the relevant (additional) bot and zoo strings to mark in your file.
Addl. zoo tag entities: 62
Acheriris Kokor Zibha (1) Antilope cervicapra (1) Ardea Argala (1) Ardea nivea (20) Ardea sibirica (27) Coluber Naga (8) Lacerta Godica (1) Noctua indica (1) Tantalus flacinellus (1) Unguis odoratus (1)
Hope this is alright for you!!
Additional bot tags also?
It may be noted that many of the above zoo tags (if not all) are changed from bot tags!
Presuming that you have some mechanism of correlating the pw_ab_9.txt with pw.txt, here are the pw_ab_9 lines containing the addl. bot tags-- Addl. bot tags (in pw_ab_9 lines).txt
I found different counts for:
Antilope cervicapra 1 2 (1 AB above, 2 CDSL)
Ardea nivea 20 21
Lacerta Godica 1 3
Unguis odoratus 1 24
These differences in both pw_9e and pw_ab_9
So, I was somewhat careless while doing this portion!!
A quick relooking into the data at my end resulted in the following zoo tag counts--
Acheriris Kokor Zibha (1) Antilope cervicapra (2) Antilope picta (3) Ardea Argala (1) Ardea Govinda (1) Ardea jaculator (2) Ardea nivea (21) Ardea sibirica (27) Ardea virgo (2) Coluber Naga (8) Coluber naga (2) Lacerta Godica (3) Musculus cucullaris s. trapezius (1) Noctua indica (1) Tantalus flacinellus (1) Turdus Ginginianus (1) Turdus Salica (4) Turdus ginginianus H. (1) Turdus macrourus (3) Unguis odoratus (25)
Cdsl version (= local temp_pw_9f.txt) and AB version (revised local pw_ab_9.txt) now agree in bot/zoo tags. (8259 bot tags,. 96 zoo tags).
Upon return to this issue, I intend to harvest the <hom>
tag markup from AB's version.
Then resolve all differences in Devanagari and italic texts.
One bad news that all my work over past 4 years has gone missing, due to hard disk crash. [This is mostly the cdsl stuff, unfortunately.]
In a way, this is good that I don't have to 'bother' Jim with all my 'sundry' and 'extraneous' works. [I think no one might have pestered him, like I did all these days.]
However, it is somewhat consoling that except the PWG set (that happened during last few months, and a "major" work it was!!) and the MW set (thorough revision that's going on), most of the work has already been posted at github.
I shall try retrieving the data from the crashed disk, but highly doubtful if I could succeed in this.
Wow! That is a difficult situation. Hope you can retrieve the lost work. If I had a similar problem, I would probably try spinrite (google search spinrite hard drive recovery by Steve Gibson at grc.com).
BTW I don't mind being 'pestered' in a good cause. I hope you will be able to resume your useful contributions to the cdsl work.
I hope you will be able to resume your useful contributions to the cdsl work.
So do I. You do not use cloud storage @Andhrabharati ?
I could get almost all the PWG, pwk and SCH stuff done at my end; but the pwkvn (recent) work is lost completely.
Also got the MW recent work back.
There are quite a few unmarked abbreviations in pw.txt. Derive a procedure for identifying and marking many of these.