sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

WIL corrections to text Devanagari #321

Closed funderburkjim closed 7 years ago

funderburkjim commented 7 years ago

By a process similar to that of #318 for the AE dictionary, potential errors in the spelling of Devanagari text in Wilson Dictionary have been developed.

This issue is devoted to those.

The potential corrections are examined by an interactive program. Here is the url for the first sample batch of 5 cases [batch 00] ---- 🚧

gasyoun commented 7 years ago

Here is the url

Hidden :zipper_mouth_face:

funderburkjim commented 7 years ago

Batch 00 🔍

Note use of the deprecated Wilson scan link.

funderburkjim commented 7 years ago

Batch 01 and Batch 02 prepared.

Batches 00-02 comprise all the 2-gram correction candidates.

When they are done, will begin to examine the 3-gram candidates (approx. 350).

funderburkjim commented 7 years ago

Batches 00-02 are now examined and the corrections installed. This completes the corrections generated by 2-grams of Wilson Devanagari text.

Will prepare 3-grams.

Want to get feedback on changes in UI at MW72 (this comment) before generating UI for 3-gram batches.

funderburkjim commented 7 years ago

Although the link to 'semi-digitized Wilson' works, I have found the (jpg) images to be too small for clear reading. I've been using a local copy of the wil_bookmark.pdf (from download page). When made to 600%, it is much clearer than the Cologne scan jpgs, in my experience. and since Wilson entries are not very long, it is not too hard to find the right entry.

gasyoun commented 7 years ago

Want to get feedback on changes in UI at MW72 (this comment) before generating UI for 3-gram batches.

Sure. There is Sergey eager to help, but he said that until he will see the non-SLP writing of the word, he can not. Now he can.

images to be too small for clear reading

Yes, non readable.

When made to 600%, it is much clearer than the Cologne scan jpgs

Oh, that's a trick. Did not test, but proposed there must be a solution.

gasyoun commented 7 years ago

What about the green blurb. Can we have it everywhere now?

funderburkjim commented 7 years ago

Re: Can we have the Green blurb everywhere.

This is hard; it requires getting specific metrics which are peculiar to each dictionary.

I'll investigate whether this can be done for WIL - maybe the metrics of the old deprecated version can be adapted to use the PDF form.

funderburkjim commented 7 years ago

Sergey eager to help ...

What needs to be done to make it easier for Sergey to help?

gasyoun commented 7 years ago

I'll investigate whether this can be done for WIL - maybe the metrics of the old deprecated version can be adapted to use the PDF form.

Please do so. Let's start with WIL. Let's do for the biggest or most dirtiest first.

What needs to be done to make it easier for Sergey to help?

One is done already - transliteration in UI as in the book. The blurb is the next. I've been working for 10 years now with Sergey, so he could help, if he will like the UI. It's up to the blurb actually now.

funderburkjim commented 7 years ago

What is the 'blurb', and how does does it need to be improved?

gasyoun commented 7 years ago

What is the 'blurb'?

van

funderburkjim commented 7 years ago

So, Sergey should be happy now with MW72?

gasyoun commented 7 years ago

with MW72

Yeah, so let the blurbs come to other dictionaries as well. He has checked many words in the Kochergina dictionary in the past (the one you compared) and after fuzzy 220 errors were found. So as I understand WIL is done and MW72 is not yet, right?

funderburkjim commented 7 years ago

Only the 2-gram corrections for WIL are done.

The 3-grams (603 candidates in all) remain to be done.

I've made a few changes to the UI, bringing it more into line with MW72 UI:

image

funderburkjim commented 7 years ago

I think the WIL and MW72 UI is probably ok now.

Agree?

If others agree that UI is OK, I'll go ahead and generate batches for the 600 WIL 3-gram cases and for the MW72 3-gram cases.

My thought is that the batch-size should be about 25 or 30 cases; The reason for the batch size is that is small enough to do a batch in an hour or so.

With such a batch size, there would be 20-25 batches for WIL 3grams, slightly fewer for MW72 3grams.

Any suggestion regarding batch-size?

funderburkjim commented 7 years ago

Here is a test batch of 5 with the latest WIL UI:

gasyoun commented 7 years ago

since corrections need to use SLP1

Sergey is for HK. Can we have in SLP1 in database, but make it look like HK at least? SLP1 is where he is gone.

few changes to the UI

Few but major, why thank you Jim!

25 or 30 cases

30 is ok.

With such a batch size, there would be 20-25 batches

If there will be an index, that's no issue. Right now, for example, I do not know what URLs to check.

SergeA commented 7 years ago

Here is a test batch of 5 with the latest WIL UI

Hi! Solved 3 of 5. But I'm in trouble with the other 2. Both about अधीङ. Wilson in his dictionary for some reason prefers to write the root इङ (without virama) instead of इङ्. Also he indicates this anubandha as ङ, not ङ्. I am not familiar with the system of anubandhas of dhatupathas, and can not say for sure, if he is wrong all the way with this omitting of viramas. And I do not clearly understand the task - what is expected to do with such cases.

And the SPL is real pain! The other link for MW72 with IAST was much more pleasing for the eyes. :)

And for that link I also have questions. There were some aorist forms there. OCR is ok. But the verb forms are complicated matter, and I can not judge on the fly if some aorist forms in MW are correct or no. Marcis says I should recheck those forms by some reference books, but he does not know, what are those books. And I don't have such books in my shelf, and don't know what to do.

funderburkjim commented 7 years ago

@SergeA Hello.

  1. Regarding SLP in WIL. Since the digitization of Wilson has SLP, it would introduce a complication to make the Correction UI in HK. Not impossible, but more complicated.

    • Since the MW72 IAST form is not problematic for you, why don't you stick to that one?
    • I'll generate the 3-gram batches soon for MW72.
  2. Regarding the MW72 aorist (or other) grammatical forms. I suggest that you do not try to resolve the accuracy of Monier's grammar in these hard cases. Rather, just assume that the text spelling is correct and focus on where there are typos (discrepancies between the digitization and the text).

  3. Regarding Wilson's lack of virama in verb forms. I don't think we should consider this a print error that we want to change, at least not for now. So, in such cases, my rule of thumb is to get the digitization to correspond to the text. From the several cases in WIlson that I have done so far, my impression is that there are many serious errors (wrong vowel or consonant) in the digitization that we want to correct (i.e. many typos in the digitization).

Do you agree with focusing on MW72 3-gram cases? If so, we can divide the labor, and I'll stick to the Wilson cases, since I am comfortable with SLP1.

gasyoun commented 7 years ago

Wilson has SLP, it would introduce a complication to make the Correction UI in HK

Everything non-SLP1 will do for @SergeA

3-gram batches soon for MW72

Eagerly waiting, Jim.

you do not try to resolve the accuracy of Monier's grammar in these hard cases

Agree, it's an another task.

get the digitization to correspond to the text

Yeah, and even the most simple things are still not identical.

SergeA commented 7 years ago

my rule of thumb is to get the digitization to correspond to the text

Then our main task is to correct digitalization typos. And where I don't see any typo, nor obvious print errors, I can mark the case "no change needed". Right?

SLP1 is quite unreadable for me. HK, or IAST, or Devanagari would be fine.

Do you agree with focusing on MW72 3-gram cases?

Don't know, what is "3-gram", but why not. Let's try.

funderburkjim commented 7 years ago

Then our main task ... Right?

Exactly.

3-gram

That just refers to the technique which generated the cases for MW72 (see #322). Briefly, here is how the cases were generated.

SLP1 is quite unreadable for me. HK, or IAST, or Devanagari would be fine.

I'll keep this in mind for later studies.

gasyoun commented 7 years ago

There are a few hundred of these. These are considered to be 'good' 3-grams

Please show me the list.

I'll keep this in mind for later studies.

Yeah, that' why @SergeA is two years behind. He missed the UI. Now he has it and HK, or IAST, or Devanagari is the last thing left. Now when @Shalu411 is gone, Sergey could bring new blood to the old dictionaries.

SergeA commented 7 years ago

Occasionally, you may encounter a non-Sanskrit word. This is a markup error, which you should classify as 'typo', with a comment such as 'non-Sanskrit word.

I'm trying to do it. But when I comment and change to the typo it says "Old and New can't be the same for a 'typo'" and don't saves the change.

funderburkjim commented 7 years ago

Here are the lists of 'good' 2-grams and 'good' 3-grams.

These were done when generating correction candidates for Sanskrit words within Apte English Sanskrit dictionary, these are the ones that Sampada is slowly working through.

My statement of 'several hundred' 3-grams was wildly inaccurate: there are 15,000+ cases in 3gram.txt !

funderburkjim commented 7 years ago

Regarding how to markup a non-Sanskrit word.

I forgot about the "Old and New can't be the same for a type" message.

So, do this instead.

I'll see the comment during installation, and make the non-standard correction at that time. So, making the Comment is the key thing to do.

gasyoun commented 7 years ago

Here are the lists of 'good' 2-grams and 'good' 3-grams.

Thanks, I guess they can be used in spell checking as well, especially the frequency data.

these are the ones that Sampada is slowly working through

Hope she finishes the hard work. What is her full name? I want to thank her in the preface of the book I'm working on.

'several hundred' 3-grams was wildly inaccurate: there are 15,000+ cases in 3gram.txt

Oh, 15k :+1:

Comment is the key thing to do.

Guess @SergeA got it.

funderburkjim commented 7 years ago

Sampada's name is Sampada Savardekar .

Her married name is Sampada SAVARDEKAR THOMAS

funderburkjim commented 7 years ago

I'm opening a new issue to deal with the 3-gram correction candidates for WIL.

This issue seems closeable.

gasyoun commented 7 years ago

Sampada SAVARDEKAR THOMAS

Thanks, so Thomas - surname?

sanskritisampada commented 7 years ago

Hello

My name is Dr. Sampada SAVARDEKAR ... THOMAS is my husband's surname which is added in a few docs only. So Dr. Sampada SAVARDEKAR is fine.

Good to be working together on this meaningful project of corrections. :-)

Regards Sampada

Sent from my iPhone

On 30 Nov 2016, at 12:31, Marcis Gasuns notifications@github.com wrote:

Sampada SAVARDEKAR THOMAS

Thanks, so Thomas - surname?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

gasyoun commented 7 years ago

So Dr. Sampada SAVARDEKAR

Oh, great. What is your hometown? Where do you live?