More corrections-- iast related etc.

Andhrabharati commented 2 years ago

Here is another lot that needs addressing, @funderburkjim. few more corrections in MD.txt

You might get this also done through @AnnaRybakovaT, by guiding her.

Andhrabharati commented 2 years ago

Some more corrections for "[^ \r\n,;:.([<]{%" also need to be done. These also include the space and transliteration (& tag) corrections.

Andhrabharati commented 2 years ago

More often than not, the "[ăĕĭŏŭ]" characters need correction.

funderburkjim commented 1 year ago

@AnnaRybakovaT

I think our next primary task should be adding abbreviation markup to dictionaries, starting with MD.

However, I see that last year @Andhrabharati made several suggestions as to miscellaneous corrections to MD. So, let's address these first in md.txt, Then we'll get to the abbreviation markup.

Instructions to get you started are in https://github.com/sanskrit-lexicon/MD/tree/master/mdissues/issue8 directory, in the 'readme.txt' file.

Andhrabharati commented 1 year ago

Just like to say that my current work at MW has 'unearthed' quite a few new abbr.s (either left unmarked, or completely unidentified thus far).

funderburkjim commented 1 year ago

@Andhrabharati When you get lists of these, they can be added to cdsl mw.txt. Some may be 'local' (i.e. to be marked as <ab n="tooltip">xyz</ab>)

Some may be 'global' (i.e. to be marked as <ab>xyz</ab>, where xyz is added to mwab_input.txt)

AnnaRybakovaT commented 1 year ago

Instructions to get you started are in

Dear Jim, I have started examination of the few.more.corrections.in.MD.txt There are a lot of cases regarding Sanskrit words. Just to be sure, please, check some my sollutions:

Line 26718: {%= ri%}ta-jāta.
Anna: ṛta-jāta.

Line 22506: {%of the Hot%}ri Anna: {%of the Hotṛ%}

Line 39465: company; {@-śrāddha@}, {%n. kind of s%}rāddha. Anna: company; {@-śrāddha@}, {%n. kind of śrāddha.

AnnaRybakovaT commented 1 year ago

I think our next primary task should be adding abbreviation markup to dictionaries, starting with MD.

As well I wanna to remind a task regarding Greek text (BOP and BUR).

Andhrabharati commented 1 year ago

@AnnaRybakovaT

if you look at my posted file (at the top), you will see that there are two varieties of corrections in the first lot- Transliteration (& tag ?) correction and missing space after %}

funderburkjim commented 1 year ago

remind a task regarding Greek text ...

@AnnaRybakovaT please provide link(s) to issue comments

funderburkjim commented 1 year ago

@AnnaRybakovaT

Your example 'few more' lines are in the right direction. Please generate a preliminary 'change_fewmore.txt' file (see readme), and push. That will help me determine if any details regarding your change method need revision.

AnnaRybakovaT commented 1 year ago

please provide link(s) to issue comments

https://github.com/sanskrit-lexicon/BOP/issues/1 https://github.com/sanskrit-lexicon/BUR/issues/2

funderburkjim commented 1 year ago

@AnnaRybakovaT There are about 1400 Greek text instances in bop.txt and about 700 Greek text instances in bur.txt.

One workflow for you might be:

Finish the current MD correction work (of this issue, which you have begun)
Proofread the Greek text in bur
Proofread the Greek text in bop
Then (maybe!) we can address the abbreviations in MD

How does this plan sound to you?

AnnaRybakovaT commented 1 year ago

How does this plan sound to you?

Dear Jim, The plan sounds great.

gasyoun commented 1 year ago

MW has 'unearthed' quite a few new abbr.s (either left unmarked, or completely unidentified thus far).

@Andhrabharati where?

AnnaRybakovaT commented 1 year ago

generate a preliminary 'change_fewmore.txt' file (see readme),

Dear Jim, Please, check the change_fewmore.txt. All corresponding changes were made in temp_mw_1.txt.

I would like you to double check some cases:

Line 84140: enjoy carnally ({%ac. of pers. or s%}arīra); receive Line new (scan has "śarīram"): enjoy carnally ({%ac. of pers. or%} śarīram); receive

Line 124385: {#SOri#}¦śaur-i, {%m. pat.%} ({%fr. s%}uāra) {%of%} Vasudeva Line new (scan has "śūra"): {#SOri#}¦śaur-i, {%m. pat.%} ({%fr.%} śūra) {%of%} Vasudeva

Line 90291: {#mAtf#}¦mā-tṛ́, {%f.%} [{%perhaps%}former {%or children%}: Line new (scan has "of children"): {#mAtf#}¦mā-tṛ́, {%f.%} [{%perhaps%} former {%of children%}:

And I have added two extra lines:

Line 98606: {%and sam%} ca yoś ca; {%V.%}). Line new: {%and%} śáṃ ca yoś ca; {%V.%}).

Lines 19797: be studied in solitudes). Line new: {%be studied in solitudes%}).

funderburkjim commented 1 year ago

@AnnaRybakovaT Have now processed corrections, starting with your 'change_few.more.txt'. I had to do a bit of preliminary adjustments (described in readme.txt). End result is 'change_fewmore.txt' (note slight difference in spelling).

Agreed with most of your changes (including those above that you asked me to review). The places in change_fewmore.txt where I altered your solutions are indicated by searching for ';x'.

These changes were applied to temp_md_0.txt to get temp_md_1.txt, and this has been installed in csl-orig, etc.

Take a look at these and mention if there are any that need further change. Otherwise, this batch of changes can be considered finished.

funderburkjim commented 1 year ago

@AnnaRybakovaT

@Andhrabharati mentioned the likelihood of more corrections with unexpected characters preceding the '{%' italic text markup. Starting with his regex, I found about 200 such cases. The File change_2.txt has these cases.

Request you to examine these cases, making necessary changes to the 'new' lines in change_2.txt.

AnnaRybakovaT commented 1 year ago

Take a look at these and mention if there are any that need further change.

Dear Jim, All alterned solutions are correct.

funderburkjim commented 1 year ago

All altered solutions are correct

@AnnaRybakovaT I'm confused. I presume you are referring to items in change_2.txt ?

But if so, there are many in change_2.txt that need to be corrected. For example, here are the first two (where I have, for this comment, made the corrections in the 'new' line.)

; <L>340<pc>003-3<k1>aN<k2>aN
1618 old {#aN#}¦a-ṅ, {%aor.%} {%suffix%} -a ({%in%} a-gam-a-t); kṛt{%suffix%} <lbinfo n="8"/>
;
1618 new {#aN#}¦a-ṅ, {%aor.%} {%suffix%} -a ({%in%} a-gam-a-t); kṛt-{%suffix%} <lbinfo n="8"/>
;---------------------------------------------------
; <L>353<pc>004-1<k1>aNkurita<k2>aNkurita
1682 old sprout{%s;%} combined with ({%in.%}).
;
1682 new sprouts; combined with ({%in.%}).

The intended task for you is to make all such required changes in change_2.txt.

funderburkjim commented 1 year ago

If you have already made these changes in your copy of change_2.txt, you need to add/commit/push change_2.txt to github so I can install the changes.

funderburkjim commented 1 year ago

@AnnaRybakovaT Now I see why I was confused 😕 No doubt you were referring to 'change_fewmore.txt'.

AnnaRybakovaT commented 1 year ago

No doubt you were referring to 'change_fewmore.txt'.

Dear Jim, yes, exactly.

AnnaRybakovaT commented 1 year ago

Request you to examine these cases, making necessary changes to the 'new' lines in change_2.txt.

Dear Jim, Please check the file change2.txt. There are some comments:

1) There are some cases with asterisks. I didn't change them: {%in., %} {%m.,%} {%n.%} {%g.%}

2) I have noticed that there is no one standard regarding cardinal numbers using for numeration of centuries, classes. Compare - (¤6¤{%th century%} and {%9th century%}

3) Hyphenation. Please pay attention on those cases, probably necessary to make some extra changes:

; 11451159-3pAtaYjalapAtaYjala 68015 old {#pAtaYjala#}¦pātañjala, {%a.%} composed by Pata{%ñ %} ; 68015 new {#pAtaYjala#}¦pātañjala, {%a.%} composed by Patañ-

-- ; 13300208-1BUtakaraRaBUtakaraRa 84591 old gñ{%as to be performed daily by the householder%}; ; 84591 new {%jñas to be performed daily by the householder%};

-- ; 14166228-1mitaBAzitfmitaBAzitf 91627 old {@-ana@}{%a.%} sparing in diet; {@-mati,@} {%a.%} having a ; 91627 new {@ana@} {%a.%} sparing in diet; {@-mati,@} {%a.%} having a

- ; 16620284-3vinawanavinawana 111812 old d{%a, Aruṇa. etc.%}: {@-tanayā,@} {%f.%} daughter of ; 111812 new {%ḍa, Aruṇa. etc.%}: {@-tanayā,@} {%f.%} daughter of

funderburkjim commented 1 year ago

@AnnaRybakovaT Am beginning review of change_2. Will present responses to your comments/questions.

1. Asterisk

I'm not sure of the significance of the asterisk in md dictionary. Suggest you carefully read the front matter where there may be some mention of the asterisk. md front matter.

You no doubt noticed that {%X%} in md.txt represents italic text. Also, the abbreviations in md.txt are presented in italic text (see front matter for the author's list of abbreviations).

We will later add 'ab' markup to abbreviations (so that users of the dictionary displays may have tooltips). Thus, we will recode {%in.%} to {%<ab>in.</ab>%} since 'in.' is the abbreviation for 'instrumental.'

To add this markup, it will be advantageous for trailing punctuation to be outside the ending italic markup. For example, {%in., %} should be changed to {%in.%}, .

I made such changes in change_2.txt at lines 28423, 28721.

I don't see a problem with the '*' preceding {%.

funderburkjim commented 1 year ago

markup using ¤

@maltenth mentions (at top of md.txt) the use of ¤X¤ markup.
There are 3390 instances of ¤ character.

I am not sure of the meaning of this markup.

It often occurs within italic markup {%x¤y¤z%}, so one speculation is that it represents non-italic text within italic text. This is just a guess.

The make_xml.py process removes this character, so the ¤ character has no impact on current displays of md.

@Andhrabharati Any idea on this?

funderburkjim commented 1 year ago

3. hyphenation

These are end-of-line hyphenations. I changed the previous (or next) line also. You can examine change_2.txt for these items to see the idea.

funderburkjim commented 1 year ago

change_2 corrections installed.

You can see my few changes to your change_2.txt in the 2a03c56 commit link above.

@AnnaRybakovaT there are always more things that we could do; but let's stick to our plan above and call this issue DONE!

Next step for you will be the proofread of BUR dictionary Greek. I'll start a new issue with details of how to get started.

Andhrabharati commented 1 year ago

I am not sure of the meaning of this markup. ……… @Andhrabharati Any idea on this?

@funderburkjim

There appear to be multiple indications for this marking; it is something like the exercise I did in ls-orphan identification in MW recently (leading to multiple new tags)!!

Before we discuss more on this, there are 25 lines in the present CDSL md.txt having numbers without this markup; and all of them need some correction. md lines having numbers without the ¤ symbol marking.txt

BTW, you seem to have removed the marking at the HW level while making the meta-lines; this markup (incl. the numbers) is missing at all those homonym number places!!

They all need to be present in the header lines as well, even if in the meta-lines, just like in the mw.txt

funderburkjim commented 1 year ago

Yes, I had noticed that missing homonym problem. Thanks for reminding me of it. I'll correct that problem.

Andhrabharati commented 1 year ago

You also need to correct the 25 lines as given in my file above, @funderburkjim !

funderburkjim commented 1 year ago

What did we miss?

funderburkjim commented 1 year ago

Are you talking about the "numbers without symbol..." file?

Andhrabharati commented 1 year ago

yes.

funderburkjim commented 1 year ago

Corrections made based on md.lines.... file.

Andhrabharati commented 1 year ago

@funderburkjim Elsewhere in the file the fractions are enclosed within round brackets, so I also followed the same style.

I see that your present corrections did not consider the same. Any specific reason??

AnnaRybakovaT commented 1 year ago

I'm not sure of the significance of the asterisk in md dictionary.

Dear Jim, Asterisk is mentioned on preface - 5

AnnaRybakovaT commented 1 year ago

These are end-of-line hyphenations. I changed the previous (or next) line also. You can examine change_2.txt for these items to see the idea.

Dear Jim, Could you kindly explain me what do mean numbers 6 and 3, for example ?:

<lbinfo n="6"/>
<lbinfo n="3"/>

{#purUcI#}¦purūc-ī, {%a. f.%} ({%of%} *puru‡a{%nk,%} extending <lbinfo n="6"/>
{%attendant on Śiva%}; {@-mathana,@} {%a.%} ({@ī@}) tormenting, <lbinfo n="3"/>

funderburkjim commented 1 year ago

@AnnaRybakovaT Thanks for that info on asterisk. When we add abbreviation markup to md.txt, we may also want to add markup to those asterisks.

funderburkjim commented 1 year ago

lbinfo in md.txt

The lbinfo tag is used in relation to hyphenation at end of lines.

The value of the 'n' attribute (in current md.txt markup) is the number of characters in the hyphenated word preceding the hyphenation '-'. Example: Suppose the printed text is

Here is an example of hyph-
enation before lbinfo markup.

And here are those two lines with lbinfo markup.

Here is an example of hyphenation <lbinfo n="4"/>
before lbinfo markup.

Also appearing in md.txt is markup like <lbinfo n="[]"/>. First example:

<L>4<pc>001-1<k1>aMSakalpanA<k2>aMSakalpanA
{#aMSakalpanA#}¦aṃśa-kalpanā, {%f.%} arrangement  <lbinfo n="[]"/>
of shares.
<LEND>

A similar markup convention is also seen in burnouf.

funderburkjim commented 1 year ago

In ap90, lbinfo is used in a different way.

<L>1<pc>0001-a<k1>a<k2>a
{#a#}¦ The first letter of the Nāgarī
Alphabet. {#--aH#} [{#avati, atati#} <lbinfo n="sAta#tvena"/>
{#sAtatvena tizWatIti vA; av-at vA, qa#} <ls>Tv.</ls>] {@1@} <ab>N.</ab>
of Viṣṇu, the first of the three
sounds constituting the sacred
syllable {#om#}; {#akAro vizRuruddizwa#} <lbinfo n="ukAra#stu"/>
{#ukArastu maheSvaraH . makArastu smfto brahmA praRavastu#}
{#trayAtmakaH ..#}; for more explanation of
the three syllables {#a, u, m#} see {#om#}. {@--2@}
<ab>N.</ab> of Śiva, Brahmā, Vāyu, or <lbinfo n="Vaiśvā+nara"/>
Vaiśvānara. {%--<ab>ind.</ab>%} {@1@} A prefix corresponding
to Latin {%in%}, <ab>Eng.</ab> {%in%} or {%un%}, <ab>Gr.</ab> {%a%} or
{%an%}, and joined to nouns, adjectives,
indeclinables (or even to verbs) as
a substitute for the negative <lbinfo n="parti+cle"/>
particle {#naY#}, and changed to {#an#} before
vowels except in the word {#a-fRin#}.
The senses of {#na#} usually <lbinfo n="enumerat+ed"/>
enumerated are six- ({%a%}) {#sAdfSya#} ‘likeness’ or

Markup to represent linebreaks is used only in a few of the Cologne digitizations. For example, it does not occur in the mw.txt digitization. Currently lbinfo markup is ignored in displays.

lbinfo markup is only possible for those dictionaries for which the original digitization by @maltenth preserved the original lines of the printed text.

funderburkjim commented 1 year ago

The adherence to printed lines (with attendant line breaks) in a digitization is not without controversy. Such adherence is useful for purposes of correcting digitization mistakes, since the comparison with printed text is direct.

But such adherence is an obstacle to the objective of optimizing the user experience of a dictionary. Starting with this comment, @vvasuki points out advantages possible by removing this adherence to printing conventions.

sanskrit-lexicon / MD