sanskrit-lexicon / MWS

Monier Monier-Williams, Sir; A Sanskrit-English dictionary. Oxford, 1899
Other
7 stars 5 forks source link

bot listing #74

Open funderburkjim opened 4 years ago

funderburkjim commented 4 years ago

mw_bot.txt lists all the <bot>X</bot> tags in mw.

The listing is in alphabetical order of X, without regard to case.

There can be found several likely typos. The first one is 'Catech' in

Acacia Catech   1
Acacia Catechu  44

I found the instance, checked the scan and found this to actually be a typo -- not yet corrected.

@gasyoun Here would be a good task for one of your students:

Make a list of all the likely typos, similar to 'Catech'.

The next example I see is 'Acacia Sirisa' in

Acacia Sirisa   1
Acacia Sirissa  36

From this list, we can make corrections.

gasyoun commented 4 years ago

good task for one of your students

Let me try to revive it.

drdhaval2785 commented 4 years ago

One possible solution to weed out would be to find names occurring only once

gasyoun commented 3 years ago

@SergeA is not a student, but he loves defects. The dictionaries he's most interested in are MW, Apte, PWG, PWK (in this order). He is ready to check different lists. @funderburkjim what kind of lists, some maybe even in the UI you developed for him, we can give? The issue is he hardly understands what is happening here in github and would want to see or hear about some milestones or plans. Is not a good time to think what's done in 2020 and what's still left, @drdhaval2785 ?

funderburkjim commented 3 years ago

he loves defects

There is no shortage of finding lists that may contain defects. For example, I just opened up ap90.txt (the digitization of Apte 1890 dictionary) available here. And started down at the first entry <L>1 and started reading.

The first thing that caught my eye as a defect was {#aMSakaH#}¦ [{#aMa-nvul; aMSikA#} {%f.%}] {@1@} One

that aMa-nvul looks wrong.

The scan of page 2 shows image

So I think aMa-nvul should be aMS-Rvul (remember this is SLP1 spelling for Devanagari text).

aMS-Rvul is a derivation (is that the right term?) of aMSaka.
In AP90 such derivations often appear within a square bracket on the first line of an entry.

There are 4770 instances with [...] occurring on first line of an entry in ap90.txt, and it is easy to make a list of these: temp_derivation_list.txt

I think it would be useful to get all the errors out of these derivations.

This is just one example that happened to occur to me at the moment.

Don't know whether this particular example is appealing to @SergeA ?

Maybe he's noticed a particular kind of error in a certain dictionary, and would like to try to identify the instances of the error and correct it ? Dhaval or I can likely provide useful 'data science' skills to help someone solve such problems and thereby improve the dictionaries.

funderburkjim commented 3 years ago

Another example of potential defects was developed by @drdhaval2785 here.

These are lists of possible English word spelling errors from various dictionaries.
Only MD English errors has been done.

And Sampada is currently working through BEN English errors.

funderburkjim commented 3 years ago

dictionaries with missing Greek text.

Recognized by <lang n="greek"></lang>

Help is needed here.

funderburkjim commented 3 years ago

Russian in MW

I have this [issue](ref: https://github.com/sanskrit-lexicon/MWS/issues/55#issuecomment-355829330 ) regarding Russian text in MW.

This needs to be reviewed and any additional corrections made.

funderburkjim commented 3 years ago

That's enough for now identifying problems that remain open and that are fixable. All we need are some willing helpers.

How do we clone several more of @sanskritisampada ?

gasyoun commented 3 years ago

There is no shortage of finding lists that may contain defects.

So he wants them all.

SLP1 spelling for Devanagari text

Serge dislikes SLP1. Can we have it in IAST for him, please?

I think it would be useful to get all the errors out of these derivations.

Agree.

Maybe he's noticed a particular kind of error in a certain dictionary, and would like to try to identify the instances of the error and correct it ?

He wants your guidance where he can be of help.

How do we clone several more of @sanskritisampada ?

There is one way, but we will need approval from Sampada first ))

Now it's up to @SergeA

funderburkjim commented 3 years ago

@gasyoun You seem to be speaking for @SergeA . Before spending the time to create IAST versions of things for @SergeA, I need you, @SergeA, to confirm what you are interested in helping with; and also exactly which documents you need to be converted to IAST.

Andhrabharati commented 6 months ago

@funderburkjim / @gasyoun,

Is this issue closable now?

funderburkjim commented 6 months ago

Recomputed mw_bot.txt and mw_bio.txt .

Quite a few differences.

These lists do need to be studied with an eye to corrections. For instance the very first item indicates a (small) error (an extra space at beginning). In addition to correcting typos, we might consider making changes in capitalization for the purpose of removing gratuitous variations - this might (or might not) involve print changes.

We have a similar task pending involving PW dictionary (Refer).

We still need someone to undertake this review.

This issue not to be closed.

Andhrabharati commented 6 months ago

@funderburkjim

I have noticed couple of issues just talking about the bot tags and their cleanup; but these are not 'handled' completely for many years now. [@gasyoun had even brought in few people aboard exclusively for the task.]

Also the tags currently include common names in English (mostly!), which cannot be taken as the Scientific names to be <bot>-tagged. [If all such are to be included, one has to do it completely for all the plant and tree names (irresp. of language).]

Can a concrete scheme/style be adopted to conclude the matter once for all? [There is one issue earlier, where Dhaval has asked about the manner of bot-tagging!]

Andhrabharati commented 6 months ago

So far as the <bio>-tag is concerned, I would like to draw attention to two of my earlier posts 1 and 2.

Andhrabharati commented 6 months ago

In addition to correcting typos, we might consider making changes in capitalization for the purpose of removing gratuitous variations - this might (or might not) involve print changes.

This idea of making the small letters as Cap.s in the Scientific names is very appreciable and makes it as per the standard norms, though it amounts to slightly 'tampering' with the print matter. This amounts to changing at ~1500 places in my mw text (and ~1400 places in the CDSL mw text). [It is to be kept in mind that the printed dictionaries were not consistent throughout in naming these, at times using cap.s and at times using small letters.]

It is not out of context to mention that the punctuation marks were pushed outside the quote marks in the MW text (recently by Jim), though it is not as per the British English style. But we sure can 'live' with such small 'liberal' changes!!

Andhrabharati commented 6 months ago

Can a concrete scheme/style be adopted to conclude the matter once for all?

@funderburkjim

Any thoughts on this?

I had spent a couple of hours today looking at the MW data and also gathering info on the Sc. naming of plants and animals (incl. fauna), and stumbled upon a simple and 'great' idea, which allows one to close this matter practically in no time. [I guess, I should be able to do it just in a day or two.]

This means, no one has 'really' put their mind on the issue for years together, but just making some passing comments!!

funderburkjim commented 6 months ago

Any thoughts on this?

I look forward to a preview of your idea

gasyoun commented 6 months ago

[It is to be kept in mind that the printed dictionaries were not consistent throughout in naming these, at times using cap.s and at times using small letters.]

@Andhrabharati I agree whe can change, not to stick as per print. As we document every change, it's not an issue at all.

no one has 'really' put their mind on the issue for years together, but just making some passing comments

We are eagerly listening.

Andhrabharati commented 6 months ago

Here is a quick summary (in counts)--

image

[On the whole, AB work has more unique entries in both the categories!!]

Andhrabharati commented 6 months ago

Just like to mention here that I had opted to use Capitalisation in MW99, as has been (very consistently) employed in MW72.