sanskrit-lexicon / WIL

Wilson Sanskrit-English Dictionary. Based on Cologne digitization. Work pertaining to corrections
2 stars 2 forks source link

Botanical names #11

Open drdhaval2785 opened 4 years ago

drdhaval2785 commented 4 years ago

WIL has many botanical names. That is giving a lot of false positives in finding out English spelling errors. We need to give it a separate tag, as in SNP.

drdhaval2785 commented 4 years ago

\(([A-Z][^.]*)[.]\) seems to be a good regex to identify the majority of botanical names. There are only a few false positives, which are either markup errors or the usage like (In the Astronomy.) etc. They can be easily weeded out.

drdhaval2785 commented 4 years ago

wil_botany.txt

This is the extracted data. Someone needs to look into it.

drdhaval2785 commented 4 years ago

A general guide can be

  1. Ignore items starting with 'In '.
  2. Remove items having more than 2 spaces in between.
gasyoun commented 4 years ago

This is the extracted data. Someone needs to look into it.

I can ask a student of mine. Want to weed out all non-flora?

drdhaval2785 commented 4 years ago

We need to weed out non-fauna. To reduce human labour, we need list of scientific names of trees, plants. Then we can compare computationally. It will reduce the human labour to a great extent.

funderburkjim commented 4 years ago

mw_bot.txt lists all the

'bot' tags in mw. It can be compared to wil_botany.txt.

gasyoun commented 4 years ago

we need list of scientific names of trees, plants.

We have done some preliminary work. The proble is that MW uses outdated terminology and so does WIL.

funderburkjim commented 4 years ago

<bot> and <bio> tags now added to Wilson (wil.txt).

See:

funderburkjim commented 4 years ago

I think this completes what was requested by @drdhaval2785 in the first comment.

gasyoun commented 4 years ago

As regards of There still remain spelling variations which likely need to be corrected in the scientific names what approach would you suggest to see them close to each other? Just read wil_bio.txt line by line?

funderburkjim commented 4 years ago

Just read wil_bio.txt line by line?

Yes.

Read wil_bot.txt and wil_bio.txt

The output could identify the lines that need to be reviewed manually for corrections. For example, a copy of wil_bio.txt could add an asterisk by those lines which would probably be in line for correction. Here is the start of such identification of wil_bio.txt. Note that those lines without an asterisk can be ignored -- they don't need to be examined further.

Abrus precatorios   1 *
Abrus precatorious  1 *
Abrus precatorius   9 *
Acacia Arabica  2
Acacia sirisa   1 *
Acacia Sirisa   1 *
Acacia Sirisha  1 *
Acacia suma 1
Acheranthes aspera  1
Achyranthes 1

This could be done in an hour or two by a student, I think.

gasyoun commented 4 years ago

Note that those lines without an asterisk can be ignored -- they don't need to be examined further

Thanks, crystal clear and will be done.

drdhaval2785 commented 3 years ago

@amygdalus There are two files wil_bot.txt (thought to be flora) and wil_bio.txt (thought to be fauna) Can you go throught the same and let us know if there is any spelling error or some flora became fauna or vice versa? https://github.com/sanskrit-lexicon/WIL/issues/11#issuecomment-648522990 for the files.