Open drdhaval2785 opened 4 years ago
\(([A-Z][^.]*)[.]\)
seems to be a good regex to identify the majority of botanical names.
There are only a few false positives, which are either markup errors or the usage like (In the Astronomy.) etc.
They can be easily weeded out.
This is the extracted data. Someone needs to look into it.
A general guide can be
This is the extracted data. Someone needs to look into it.
I can ask a student of mine. Want to weed out all non-flora?
We need to weed out non-fauna. To reduce human labour, we need list of scientific names of trees, plants. Then we can compare computationally. It will reduce the human labour to a great extent.
mw_bot.txt lists all the
'bot' tags in mw. It can be compared to wil_botany.txt.
we need list of scientific names of trees, plants.
We have done some preliminary work. The proble is that MW uses outdated terminology and so does WIL.
<bot>
and <bio>
tags now added to Wilson (wil.txt).
See:
I think this completes what was requested by @drdhaval2785 in the first comment.
As regards of There still remain spelling variations which likely need to be corrected in the scientific names
what approach would you suggest to see them close to each other? Just read wil_bio.txt
line by line?
Just read wil_bio.txt line by line?
Yes.
Read wil_bot.txt and wil_bio.txt
The output could identify the lines that need to be reviewed manually for corrections. For example, a copy of wil_bio.txt could add an asterisk by those lines which would probably be in line for correction. Here is the start of such identification of wil_bio.txt. Note that those lines without an asterisk can be ignored -- they don't need to be examined further.
Abrus precatorios 1 *
Abrus precatorious 1 *
Abrus precatorius 9 *
Acacia Arabica 2
Acacia sirisa 1 *
Acacia Sirisa 1 *
Acacia Sirisha 1 *
Acacia suma 1
Acheranthes aspera 1
Achyranthes 1
This could be done in an hour or two by a student, I think.
Note that those lines without an asterisk can be ignored -- they don't need to be examined further
Thanks, crystal clear and will be done.
@amygdalus There are two files wil_bot.txt (thought to be flora) and wil_bio.txt (thought to be fauna) Can you go throught the same and let us know if there is any spelling error or some flora became fauna or vice versa? https://github.com/sanskrit-lexicon/WIL/issues/11#issuecomment-648522990 for the files.
WIL has many botanical names. That is giving a lot of false positives in finding out English spelling errors. We need to give it a separate tag, as in SNP.