Closed Andhrabharati closed 1 year ago
Some more corrections for "[^ \r\n,;:.([<]{%" also need to be done. These also include the space and transliteration (& tag) corrections.
More often than not, the "[ăĕĭŏŭ]" characters need correction.
@AnnaRybakovaT
I think our next primary task should be adding abbreviation markup to dictionaries, starting with MD.
However, I see that last year @Andhrabharati made several suggestions as to miscellaneous corrections to MD. So, let's address these first in md.txt, Then we'll get to the abbreviation markup.
Instructions to get you started are in https://github.com/sanskrit-lexicon/MD/tree/master/mdissues/issue8 directory, in the 'readme.txt' file.
Just like to say that my current work at MW has 'unearthed' quite a few new abbr.s (either left unmarked, or completely unidentified thus far).
@Andhrabharati When you get lists of these, they can be added to cdsl mw.txt.
Some may be 'local' (i.e. to be marked as <ab n="tooltip">xyz</ab>
)
Some may be 'global' (i.e. to be marked as <ab>xyz</ab>
, where xyz is added to
mwab_input.txt)
Instructions to get you started are in
Dear Jim, I have started examination of the few.more.corrections.in.MD.txt There are a lot of cases regarding Sanskrit words. Just to be sure, please, check some my sollutions:
Line 26718: {%= ri%}ta-jāta.
Anna: ṛta-jāta.
Line 22506: {%of the Hot%}ri Anna: {%of the Hotṛ%}
Line 39465: company; {@-śrāddha@}, {%n. kind of s%}rāddha. Anna: company; {@-śrāddha@}, {%n. kind of śrāddha.
I think our next primary task should be adding abbreviation markup to dictionaries, starting with MD.
As well I wanna to remind a task regarding Greek text (BOP and BUR).
@AnnaRybakovaT
if you look at my posted file (at the top), you will see that there are two varieties of corrections in the first lot-
Transliteration (& tag ?) correction
and missing space after %}
remind a task regarding Greek text ...
@AnnaRybakovaT please provide link(s) to issue comments
@AnnaRybakovaT
Your example 'few more' lines are in the right direction. Please generate a preliminary 'change_fewmore.txt' file (see readme), and push. That will help me determine if any details regarding your change method need revision.
please provide link(s) to issue comments
https://github.com/sanskrit-lexicon/BOP/issues/1 https://github.com/sanskrit-lexicon/BUR/issues/2
@AnnaRybakovaT There are about 1400 Greek text instances in bop.txt and about 700 Greek text instances in bur.txt.
One workflow for you might be:
How does this plan sound to you?
How does this plan sound to you?
Dear Jim, The plan sounds great.
MW has 'unearthed' quite a few new abbr.s (either left unmarked, or completely unidentified thus far).
@Andhrabharati where?
generate a preliminary 'change_fewmore.txt' file (see readme),
Dear Jim, Please, check the change_fewmore.txt. All corresponding changes were made in temp_mw_1.txt.
I would like you to double check some cases:
Line 84140: enjoy carnally ({%ac. of pers. or s%}arīra); receive
Line 124385: {#SOri#}¦śaur-i, {%m. pat.%} ({%fr. s%}uāra) {%of%} Vasudeva Line new (scan has "śūra"): {#SOri#}¦śaur-i, {%m. pat.%} ({%fr.%} śūra) {%of%} Vasudeva
Line 90291: {#mAtf#}¦mā-tṛ́, {%f.%} [{%perhaps%}former {%or children%}: Line new (scan has "of children"): {#mAtf#}¦mā-tṛ́, {%f.%} [{%perhaps%} former {%of children%}:
And I have added two extra lines:
Line 98606: {%and sam%} ca yoś ca; {%V.%}). Line new: {%and%} śáṃ ca yoś ca; {%V.%}).
Lines 19797: be studied in solitudes). Line new: {%be studied in solitudes%}).
@AnnaRybakovaT Have now processed corrections, starting with your 'change_few.more.txt'. I had to do a bit of preliminary adjustments (described in readme.txt). End result is 'change_fewmore.txt' (note slight difference in spelling).
Agreed with most of your changes (including those above that you asked me to review). The places in change_fewmore.txt where I altered your solutions are indicated by searching for ';x'.
These changes were applied to temp_md_0.txt to get temp_md_1.txt, and this has been installed in csl-orig, etc.
Take a look at these and mention if there are any that need further change. Otherwise, this batch of changes can be considered finished.
@AnnaRybakovaT
@Andhrabharati mentioned the likelihood of more corrections with unexpected characters preceding the '{%' italic text markup. Starting with his regex, I found about 200 such cases. The File change_2.txt has these cases.
Request you to examine these cases, making necessary changes to the 'new' lines in change_2.txt.
Take a look at these and mention if there are any that need further change.
Dear Jim, All alterned solutions are correct.
All altered solutions are correct
@AnnaRybakovaT I'm confused. I presume you are referring to items in change_2.txt ?
But if so, there are many in change_2.txt that need to be corrected. For example, here are the first two (where I have, for this comment, made the corrections in the 'new' line.)
; <L>340<pc>003-3<k1>aN<k2>aN
1618 old {#aN#}¦a-ṅ, {%aor.%} {%suffix%} -a ({%in%} a-gam-a-t); kṛt{%suffix%} <lbinfo n="8"/>
;
1618 new {#aN#}¦a-ṅ, {%aor.%} {%suffix%} -a ({%in%} a-gam-a-t); kṛt-{%suffix%} <lbinfo n="8"/>
;---------------------------------------------------
; <L>353<pc>004-1<k1>aNkurita<k2>aNkurita
1682 old sprout{%s;%} combined with ({%in.%}).
;
1682 new sprouts; combined with ({%in.%}).
The intended task for you is to make all such required changes in change_2.txt.
If you have already made these changes in your copy of change_2.txt, you need to add/commit/push change_2.txt to github so I can install the changes.
@AnnaRybakovaT Now I see why I was confused 😕 No doubt you were referring to 'change_fewmore.txt'.
No doubt you were referring to 'change_fewmore.txt'.
Dear Jim, yes, exactly.
Request you to examine these cases, making necessary changes to the 'new' lines in change_2.txt.
Dear Jim, Please check the file change2.txt. There are some comments:
1) There are some cases with asterisks. I didn't change them: {%in., %} {%m.,%} {%n.%} {%g.%}
2) I have noticed that there is no one standard regarding cardinal numbers using for numeration of centuries, classes. Compare - (¤6¤{%th century%} and {%9th century%}
3) Hyphenation. Please pay attention on those cases, probably necessary to make some extra changes:
;
--
;
--
;
-
;
@AnnaRybakovaT Am beginning review of change_2. Will present responses to your comments/questions.
I'm not sure of the significance of the asterisk in md dictionary. Suggest you carefully read the front matter where there may be some mention of the asterisk. md front matter.
You no doubt noticed that {%X%}
in md.txt represents italic text.
Also, the abbreviations in md.txt are presented in italic text (see front
matter for the author's list of abbreviations).
We will later add 'ab' markup to abbreviations (so that users of
the dictionary displays may have tooltips).
Thus, we will recode {%in.%}
to {%<ab>in.</ab>%}
since 'in.' is the
abbreviation for 'instrumental.'
To add this markup, it will be advantageous for trailing punctuation
to be outside the ending italic markup.
For example, {%in., %}
should be changed to {%in.%},
.
I made such changes in change_2.txt at lines 28423, 28721.
I don't see a problem with the '*' preceding {%
.
@maltenth mentions (at top of md.txt) the use of ¤X¤
markup.
There are 3390 instances of ¤ character.
I am not sure of the meaning of this markup.
It often occurs within italic markup {%x¤y¤z%}
, so one speculation is that it represents non-italic text within italic text.
This is just a guess.
The make_xml.py process removes this character, so the ¤ character has no impact on current displays of md.
@Andhrabharati Any idea on this?
These are end-of-line hyphenations. I changed the previous (or next) line also. You can examine change_2.txt for these items to see the idea.
You can see my few changes to your change_2.txt in the 2a03c56 commit link above.
@AnnaRybakovaT there are always more things that we could do; but let's stick to our plan above and call this issue DONE!
Next step for you will be the proofread of BUR dictionary Greek. I'll start a new issue with details of how to get started.
I am not sure of the meaning of this markup. ……… @Andhrabharati Any idea on this?
@funderburkjim
There appear to be multiple indications for this marking; it is something like the exercise I did in ls-orphan identification in MW recently (leading to multiple new tags)!!
Before we discuss more on this, there are 25 lines in the present CDSL md.txt having numbers without this markup; and all of them need some correction. md lines having numbers without the ¤ symbol marking.txt
BTW, you seem to have removed the marking at the HW level while making the meta-lines; this markup (incl. the numbers) is missing at all those homonym number places!!
They all need to be present in the header lines as well, even if in the meta-lines, just like in the mw.txt
Yes, I had noticed that missing homonym problem. Thanks for reminding me of it. I'll correct that problem.
You also need to correct the 25 lines as given in my file above, @funderburkjim !
What did we miss?
Are you talking about the "numbers without symbol..." file?
yes.
Corrections made based on md.lines....
file.
@funderburkjim Elsewhere in the file the fractions are enclosed within round brackets, so I also followed the same style.
I see that your present corrections did not consider the same. Any specific reason??
I'm not sure of the significance of the asterisk in md dictionary.
Dear Jim, Asterisk is mentioned on preface - 5
These are end-of-line hyphenations. I changed the previous (or next) line also. You can examine change_2.txt for these items to see the idea.
Dear Jim, Could you kindly explain me what do mean numbers 6 and 3, for example ?:
<lbinfo n="6"/>
<lbinfo n="3"/>
{#purUcI#}¦purūc-ī, {%a. f.%} ({%of%} *puru‡a{%nk,%} extending <lbinfo n="6"/>
{%attendant on Śiva%}; {@-mathana,@} {%a.%} ({@ī@}) tormenting, <lbinfo n="3"/>
@AnnaRybakovaT Thanks for that info on asterisk. When we add abbreviation markup to md.txt, we may also want to add markup to those asterisks.
The lbinfo tag is used in relation to hyphenation at end of lines.
The value of the 'n' attribute (in current md.txt markup) is the number of characters in the hyphenated word preceding the hyphenation '-'. Example: Suppose the printed text is
Here is an example of hyph-
enation before lbinfo markup.
And here are those two lines with lbinfo markup.
Here is an example of hyphenation <lbinfo n="4"/>
before lbinfo markup.
Also appearing in md.txt is markup like <lbinfo n="[]"/>
. First example:
<L>4<pc>001-1<k1>aMSakalpanA<k2>aMSakalpanA
{#aMSakalpanA#}¦aṃśa-kalpanā, {%f.%} arrangement <lbinfo n="[]"/>
of shares.
<LEND>
A similar markup convention is also seen in burnouf.
In ap90, lbinfo is used in a different way.
<L>1<pc>0001-a<k1>a<k2>a
{#a#}¦ The first letter of the Nāgarī
Alphabet. {#--aH#} [{#avati, atati#} <lbinfo n="sAta#tvena"/>
{#sAtatvena tizWatIti vA; av-at vA, qa#} <ls>Tv.</ls>] {@1@} <ab>N.</ab>
of Viṣṇu, the first of the three
sounds constituting the sacred
syllable {#om#}; {#akAro vizRuruddizwa#} <lbinfo n="ukAra#stu"/>
{#ukArastu maheSvaraH . makArastu smfto brahmA praRavastu#}
{#trayAtmakaH ..#}; for more explanation of
the three syllables {#a, u, m#} see {#om#}. {@--2@}
<ab>N.</ab> of Śiva, Brahmā, Vāyu, or <lbinfo n="Vaiśvā+nara"/>
Vaiśvānara. {%--<ab>ind.</ab>%} {@1@} A prefix corresponding
to Latin {%in%}, <ab>Eng.</ab> {%in%} or {%un%}, <ab>Gr.</ab> {%a%} or
{%an%}, and joined to nouns, adjectives,
indeclinables (or even to verbs) as
a substitute for the negative <lbinfo n="parti+cle"/>
particle {#naY#}, and changed to {#an#} before
vowels except in the word {#a-fRin#}.
The senses of {#na#} usually <lbinfo n="enumerat+ed"/>
enumerated are six- ({%a%}) {#sAdfSya#} ‘likeness’ or
Markup to represent linebreaks is used only in a few of the Cologne digitizations. For example, it does not occur in the mw.txt digitization. Currently lbinfo markup is ignored in displays.
lbinfo markup is only possible for those dictionaries for which the original digitization by @maltenth preserved the original lines of the printed text.
The adherence to printed lines (with attendant line breaks) in a digitization is not without controversy. Such adherence is useful for purposes of correcting digitization mistakes, since the comparison with printed text is direct.
But such adherence is an obstacle to the objective of optimizing the user experience of a dictionary. Starting with this comment, @vvasuki points out advantages possible by removing this adherence to printing conventions.
Here is another lot that needs addressing, @funderburkjim. few more corrections in MD.txt
You might get this also done through @AnnaRybakovaT, by guiding her.