Closed funderburkjim closed 7 years ago
Here is the url
Hidden :zipper_mouth_face:
Batch 00 🔍
Note use of the deprecated Wilson scan link.
Batches 00-02 are now examined and the corrections installed. This completes the corrections generated by 2-grams of Wilson Devanagari text.
Will prepare 3-grams.
Want to get feedback on changes in UI at MW72 (this comment) before generating UI for 3-gram batches.
Although the link to 'semi-digitized Wilson' works, I have found the (jpg) images to be too small for clear reading. I've been using a local copy of the wil_bookmark.pdf (from download page). When made to 600%, it is much clearer than the Cologne scan jpgs, in my experience. and since Wilson entries are not very long, it is not too hard to find the right entry.
Want to get feedback on changes in UI at MW72 (this comment) before generating UI for 3-gram batches.
Sure. There is Sergey eager to help, but he said that until he will see the non-SLP writing of the word, he can not. Now he can.
images to be too small for clear reading
Yes, non readable.
When made to 600%, it is much clearer than the Cologne scan jpgs
Oh, that's a trick. Did not test, but proposed there must be a solution.
What about the green blurb. Can we have it everywhere now?
Re: Can we have the Green blurb everywhere.
This is hard; it requires getting specific metrics which are peculiar to each dictionary.
I'll investigate whether this can be done for WIL - maybe the metrics of the old deprecated version can be adapted to use the PDF form.
Sergey eager to help ...
What needs to be done to make it easier for Sergey to help?
I'll investigate whether this can be done for WIL - maybe the metrics of the old deprecated version can be adapted to use the PDF form.
Please do so. Let's start with WIL. Let's do for the biggest or most dirtiest first.
What needs to be done to make it easier for Sergey to help?
One is done already - transliteration in UI as in the book. The blurb is the next. I've been working for 10 years now with Sergey, so he could help, if he will like the UI. It's up to the blurb actually now.
What is the 'blurb', and how does does it need to be improved?
What is the 'blurb'?
So, Sergey should be happy now with MW72?
with MW72
Yeah, so let the blurbs come to other dictionaries as well. He has checked many words in the Kochergina dictionary in the past (the one you compared) and after fuzzy 220 errors were found. So as I understand WIL is done and MW72 is not yet, right?
Only the 2-gram corrections for WIL are done.
The 3-grams (603 candidates in all) remain to be done.
I've made a few changes to the UI, bringing it more into line with MW72 UI:
line # xxxx
to get the blurb.
I think the WIL and MW72 UI is probably ok now.
Agree?
If others agree that UI is OK, I'll go ahead and generate batches for the 600 WIL 3-gram cases and for the MW72 3-gram cases.
My thought is that the batch-size should be about 25 or 30 cases; The reason for the batch size is that is small enough to do a batch in an hour or so.
With such a batch size, there would be 20-25 batches for WIL 3grams, slightly fewer for MW72 3grams.
Any suggestion regarding batch-size?
Here is a test batch of 5 with the latest WIL UI:
since corrections need to use SLP1
Sergey is for HK. Can we have in SLP1 in database, but make it look like HK at least? SLP1 is where he is gone.
few changes to the UI
Few but major, why thank you Jim!
25 or 30 cases
30 is ok.
With such a batch size, there would be 20-25 batches
If there will be an index, that's no issue. Right now, for example, I do not know what URLs to check.
Here is a test batch of 5 with the latest WIL UI
Hi! Solved 3 of 5. But I'm in trouble with the other 2. Both about अधीङ. Wilson in his dictionary for some reason prefers to write the root इङ (without virama) instead of इङ्. Also he indicates this anubandha as ङ, not ङ्. I am not familiar with the system of anubandhas of dhatupathas, and can not say for sure, if he is wrong all the way with this omitting of viramas. And I do not clearly understand the task - what is expected to do with such cases.
And the SPL is real pain! The other link for MW72 with IAST was much more pleasing for the eyes. :)
And for that link I also have questions. There were some aorist forms there. OCR is ok. But the verb forms are complicated matter, and I can not judge on the fly if some aorist forms in MW are correct or no. Marcis says I should recheck those forms by some reference books, but he does not know, what are those books. And I don't have such books in my shelf, and don't know what to do.
@SergeA Hello.
Regarding SLP in WIL. Since the digitization of Wilson has SLP, it would introduce a complication to make the Correction UI in HK. Not impossible, but more complicated.
Regarding the MW72 aorist (or other) grammatical forms. I suggest that you do not try to resolve the accuracy of Monier's grammar in these hard cases. Rather, just assume that the text spelling is correct and focus on where there are typos (discrepancies between the digitization and the text).
Regarding Wilson's lack of virama in verb forms. I don't think we should consider this a print error that we want to change, at least not for now. So, in such cases, my rule of thumb is to get the digitization to correspond to the text. From the several cases in WIlson that I have done so far, my impression is that there are many serious errors (wrong vowel or consonant) in the digitization that we want to correct (i.e. many typos in the digitization).
Do you agree with focusing on MW72 3-gram cases? If so, we can divide the labor, and I'll stick to the Wilson cases, since I am comfortable with SLP1.
Wilson has SLP, it would introduce a complication to make the Correction UI in HK
Everything non-SLP1 will do for @SergeA
3-gram batches soon for MW72
Eagerly waiting, Jim.
you do not try to resolve the accuracy of Monier's grammar in these hard cases
Agree, it's an another task.
get the digitization to correspond to the text
Yeah, and even the most simple things are still not identical.
my rule of thumb is to get the digitization to correspond to the text
Then our main task is to correct digitalization typos. And where I don't see any typo, nor obvious print errors, I can mark the case "no change needed". Right?
SLP1 is quite unreadable for me. HK, or IAST, or Devanagari would be fine.
Do you agree with focusing on MW72 3-gram cases?
Don't know, what is "3-gram", but why not. Let's try.
Then our main task ... Right?
Exactly.
3-gram
That just refers to the technique which generated the cases for MW72 (see #322). Briefly, here is how the cases were generated.
SLP1 is quite unreadable for me. HK, or IAST, or Devanagari would be fine.
I'll keep this in mind for later studies.
There are a few hundred of these. These are considered to be 'good' 3-grams
Please show me the list.
I'll keep this in mind for later studies.
Yeah, that' why @SergeA is two years behind. He missed the UI. Now he has it and HK, or IAST, or Devanagari
is the last thing left. Now when @Shalu411 is gone, Sergey could bring new blood to the old dictionaries.
Occasionally, you may encounter a non-Sanskrit word. This is a markup error, which you should classify as 'typo', with a comment such as 'non-Sanskrit word.
I'm trying to do it. But when I comment and change to the typo it says "Old and New can't be the same for a 'typo'" and don't saves the change.
Here are the lists of 'good' 2-grams and 'good' 3-grams.
These were done when generating correction candidates for Sanskrit words within Apte English Sanskrit dictionary, these are the ones that Sampada is slowly working through.
My statement of 'several hundred' 3-grams was wildly inaccurate: there are 15,000+ cases in 3gram.txt !
Regarding how to markup a non-Sanskrit word.
I forgot about the "Old and New can't be the same for a type" message.
So, do this instead.
I'll see the comment during installation, and make the non-standard correction at that time. So, making the Comment is the key thing to do.
Here are the lists of 'good' 2-grams and 'good' 3-grams.
Thanks, I guess they can be used in spell checking as well, especially the frequency data.
these are the ones that Sampada is slowly working through
Hope she finishes the hard work. What is her full name? I want to thank her in the preface of the book I'm working on.
'several hundred' 3-grams was wildly inaccurate: there are 15,000+ cases in 3gram.txt
Oh, 15k :+1:
Comment is the key thing to do.
Guess @SergeA got it.
Sampada's name is Sampada Savardekar .
Her married name is Sampada SAVARDEKAR THOMAS
I'm opening a new issue to deal with the 3-gram correction candidates for WIL.
This issue seems closeable.
Sampada SAVARDEKAR THOMAS
Thanks, so Thomas - surname?
Hello
My name is Dr. Sampada SAVARDEKAR ... THOMAS is my husband's surname which is added in a few docs only. So Dr. Sampada SAVARDEKAR is fine.
Good to be working together on this meaningful project of corrections. :-)
Regards Sampada
Sent from my iPhone
On 30 Nov 2016, at 12:31, Marcis Gasuns notifications@github.com wrote:
Sampada SAVARDEKAR THOMAS
Thanks, so Thomas - surname?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
So Dr. Sampada SAVARDEKAR
Oh, great. What is your hometown? Where do you live?
By a process similar to that of #318 for the AE dictionary, potential errors in the spelling of Devanagari text in Wilson Dictionary have been developed.
This issue is devoted to those.
The potential corrections are examined by an interactive program. Here is the url for the first sample batch of 5 cases [batch 00] ---- 🚧