Closed gasyoun closed 3 months ago
Here is a list to get you started. It is a csv file. Records are like:
aMSumatI,72,Hedysarum Gangeticum
aMSumatPalA,73,Musa Paradisiaca
akarA,168,Emblic Myrobalan
akarA,168,Phyllanthus Emblica
...
akzamAlA,495,Eleocarpus <<< (only one word here)
key1 (SLP) , L,contents of bot tag. Records are in L-order, and there may be multiple entries for a given L (like akarA).
Made changes to mw.xml to (a) Replace the _ with space in bot tags (b) Merge consecutive bot tags.
The listing above was done after these changes.
Here is a list to get you started. It is a csv file.
Please add the code or Regex you used to get it.
Merge consecutive bot tags
Great, thanks, helps a lot. Jim, can you a link from the SLP1 name to MW Cologne URL, please? 7317 clicking is 7317 better than copypasting.
<c><bot>Guilandina</bot>_or_<bot>Hyperanthera</bot>_<bot>Moringa</bot></c> <ls>L.</ls>
<ls>L.</ls>
after botanical nomenclature is not L[exicographer], but Carl Linnaeus.
So the L. needs to kept in a separate column.
<c>the_plant_<bot>Convolvolus</bot>_<bot>Argenteus</bot>_or_<bot>Ipomoea</bot>_<bot>Pes</bot>_<bot>Caprae</bot>_<bot>Roth.</bot></c>
As you can see it's not always Linnaeus, in case above it's Roth, but not Sanskritist, it's https://en.wikipedia.org/wiki/Albrecht_Wilhelm_Roth So instead of
There are 45 cases of .</bot>
= 45 wrong markups.
<bot>Roxb.</bot>
<bot>Hex.</bot>
<bot>Gaertn.</bot>
<bot>Nees.</bot>
<bot>Schott.</bot>
<bot>Bl.</bot>
<bot>Wall.</bot>
<bot>Benth.</bot>
<bot>Spreng.</bot>
<bot>Willd.</bot>
<bot>Schott.</bot>
<bot>Wall.</bot>
and wrongly marked
<bot>Erycibe_Paniculata_Roxb.</bot>
-> <bot>Erycibe_Paniculata</bot><ls>Roxb.</ls>
<bot>Blyxa_Octandra_Rich.</bot>
as well, I guess
<bot>Roxburghii_Wall.</bot>
<bot>Arabicus._I.</bot>
-> <ls>I.</ls>
?
Quite many. Everything as a 3rd element with a dot is a surname after a plant.
<c>a_kind_of_grass_<p>Deotar_,_<bot>Andropogon</bot>_<bot>Serratus</bot></p></c>
The surrounding text from
And another column - text after. See
<ls>L.</ls> <p><c>prob.</c>~ <ab>w.r.</ab>~<c>for</c>~<s>-saha</s>
Now we have
Zanthoxylon Alatum
Zanthoxylon Hastile
Zanthoxylon Rhetsa
As one entry great. But additionaly to this column Yuliya wants to have them separated as well. So she can sort in all possible ways. Do I express myself correct, @Amygdalus?
Convolvulus Paniculatus
-> Convolvulus
unmarked [L=203591]
It seems to me, that you are right. I would like to have a list with several columns; a) translit b) name of genus c) epithet of species - should start from lowercase letter if it's possible to do by machine method d) name of author of description e) another information (like which part of a plant is mentioned - seeds, roots, etc.). f) Harvard translit j) another information needed
And, please, could you give me some instructions about my table in Excel: how should I present the data in columns to quick 'return' to MW? Let's think about return of data to MW. Copypaste with over 7000 names is awful.
Sorry for English. French is easier. :)
P.S. Didn't understand the last sentence: Convolvulus Paniculatus -> Convolvulus unmarked [L=203591] Paniculatus - is a species' epithet. It should be kept.
P.S. Didn't understand the last sentence: Convolvulus Paniculatus -> Convolvulus unmarked [L=203591] Paniculatus - is a species' epithet. It should be kept.
Everything is fine with the printed text, but there is a digital issue. Text is perfect, but sometimes markup issues occur. As the Convolvulus
is left unmarked in the orignal file, it will not be in XLS file of yours as well. So your work will help to clean the XML file as well. It's both ways.
Several corrections made based on the above comment. Then allbot output regenerated.
Have generated output in two forms:
<bot>
tag. I'll worry about lower-casing the species, and other details once the
data problems are less prominent.9418 lines written to allbot1.txt
NUMBER OF WORDS IN <BOT>: FREQUENCY
1 2290
2 7076
3 50
4 2
Casual observation of the nwords=1 cases of the allbot1 file led to the suspicion that some of the
<bot>x</bot>
instances where x is a word that starts with a lower-case letter are mis-marked.
For instance I suspect that <bot>sesamum</bot>
should not have the <bot>
tag -- rather it is
just a common name and should be treated like any other text. That's the hypothesis.
A filter of allbot1 for such cases results in 237 such instances, in 46 different words. The list is allbot1_lower_one.txt.
Maybe @Amygdalus can identify the ones of these where the <bot>
tag should be removed, and I'll do those corrections.
Maybe @Amygdalus can identify the ones of these where the
tag should be removed, and I'll do those corrections.
Yes, that would be perfect. Thanks, Jim.
Noticed an instance under headword SatacCada:
a sort of woodpecker , <bio>Picus</bio> <bot>Bengalensis</bot> <ls>L.</ls>
Changing to:
a sort of woodpecker , <bio>Picus Bengalensis</bot> <ls>L.</ls>
There are 6 more </bio> <bot>
cases to examine for mis-marking, and
26 cases of form </bot> <bio>
to examine.
Incidentally the <bio>X</bio>
tag in mw.xml is intended for scientific names of animals.
This incorrectly marked as <bot>
. 6 cases
Convolvulus Paniculatus -> Convolvulus unmarked
There are many like this. Present work identified and corrected markup for 711 of this type.
temp_bot11_log.txt shows the newly marked instances (228 different instances). My glance through this file suggests the corrections didn't generate any false positives. @Amygdalus might check that all these look like legitimate botanical names.
After correction, allbot1 files regenerated, under name allbot1a:
9379 lines written to allbot1a.txt
NUMBER OF WORDS IN <BOT>: FREQUENCY
1 1541
2 7786
3 50
4 2
Present work identified and corrected markup for 711 of this type.
Good to know it can be partly automated as well.
Dear Jim! I would like to be sure that I understand your task correctly:
"or instance I suspect that
"There are 6 more
3."temp_bot11_log.txt shows the newly marked instances (228 different instances). My glance through this file suggests the corrections didn't generate any false positives". Should I check 228 (or more?) names - to identify their botanical issue?
f. Coc. Tomentosus &c. VarBṛS. ; Bhpr. &c. [L=253014]
Coc.
unmarked.
m. Blumea lacera L. [L=84090]
lacera
unmarked.
@Amygdalus
Please forgive me for letting this slip by. I'm still not ready to continue the investigation, but have not forgotten about it.
I was reminded about it when commenting on a reorganization of the Cologne digitization of Meulenbeld's Sanskrit Name of Plants; brief comments here; and some orientation in the front matter.
Is this something you still have an interest in pursuing, when I get the time to attend to my part ?
Also, could you tell me a bit about yourself, so I'll know how to think about your interest in this subject?
Dear Jim, I remember about this work and I find a true interest in it. So, I'll wait for your possibility to continue working with it (and I use this time like a chance to finish or to ameliorate some of my own affairs). I need only an Excel file which will be simply reconvertable for you in MW after finishing the task. About my interest: I've got PhD in biology in 2009 and worked in university on ecology chair till 2017. My specialisation is plant ecology, but of course I have a possibility to recognise and to renew plant names. I already did the same type of work for Michel Angot's article, with whom I work as a translator from French into Russian from 2005. The main interest is in using professional possibilities for development of Sanskrit's studies - I look at it like at useful hobby. My husband is also connected to sanskrit. So, I hope to get the Excel file some day, but I don't hurry the participants because of my own affairs with children (7 monthes and 7 years) and finishing proofreading a translation of a book.
@Amygdalus Do you still have interest in this work?
Guess not.
Dear @drdhaval2785 ! Yes, I'm still interested in this work, but I don't work now with Marcis Gasuns at all. I'm a member of another team of sanskrit specialists. So if this work is actual, I need the same thing as in 2018: a clear instructions about the document which I should transmit to the project during the work.
In reviewing the comment history, I must admit I don't have a clear idea of the objectives.
This was first brought up by Marcis, so he and @Amygdalus should develop the requirements and objectives. Then maybe Dhaval and/or I can help.
Even more ideal would be if @Amygdalus can do programming. It's always better if the person with the primary interest does most of the work. I would prefer in this instance at this time to advise rather than to do all the coding.
@Amygdalus let me ask if you have a picture in mind of what is required and what can actually be done by you?
Dear @funderburkjim ! I can't do programming - I have another specialisation. And this project is only one of many others things I have to do - it's not my primary interest.
As far as I understand, firstly the question about the mode of appearance of this renewed names in the dictionary should be decided. But I don't depend on your decisions: I can do my work by adding a new column of new names - near already mentioned species from MW. In my Excel file. Than I can transmit this work to those who are responsible for the dictionary - Dhaval or you.
If my work is not applicable to the dictionary (or nobody knows how to include it into MW), I can put it to open sources as a free reference book. Such work is not a work of one person (I mean the botanical side of work). For example, I should address to the specialists of regional flora. So if we decide to do this work, I'll invite another people interested in it and I'll consult with botanists.
@drdhaval2785 I can't check two lists (plants and animals) only by having a look on them. I should read each line. So only during the work I can find all mistakes.
@Amygdalus Thanks for your reply.
Suggest you provide CSV form of your Excel file, as it will be easier for further use in non-excel workflows. (CSV = comma-separated-values. Sometimes, if the values have commas, it is better to use TSV = tab-separated-values. Probably excel has a way to export a table into this form )
You can upload such a text file to this issue simply by dragging it into a comment.
Also, suggest you send a very early form of the table you have in mind, with only a few rows. That way, we can understand what is involved and have something tangible to comment on.
Do you need something from us to get started?
Probably excel has a way to export a table into this form
Exactly.
@funderburkjim I need nothing. I have an Excel file with data, and after 15 january I'll organize my time with this activity in my timetable.
@Amygdalus ok. Sounds good.
Let me mention one thought that may (or may not) be something related to what you are doing.
It is likely that there are some errors in our <bot>
and <bio>
tagging of mw. This could be:
When you notice errors in mw (such as in that allbot1.txt
file mentioned in a comment above),
we would like to correct the errors; so make some kind of list of such errors as you notice them.
Again, just a suggestion.
@funderburkjim yes, Jim, of course.
@funderburkjim list of a few hundred possible errors by @artanat in digital copy of MW
By a fuzzy-search similar algorithm.
About the actualisation of names of plants: I wait now for some books from India in sanskrit for my work, because after some spade-work it became obvious that the task is more complicated than it was expected. So I need some original sources to go further. As for the list of possible errors above - there are some mistakes in correction and the list should be checked manually.
some books from India
Would love to know the list, because most books are already scanned.
more complicated than it was expected
Bet it is.
I need some original sources
Please let me know the list. I'll see what I can do, to move things forward.
some mistakes in correction and the list should be checked manually.
Sure, but it's the first thing to start with. To weed out the big dirt. @funderburkjim most of them OCR errors.
@Amygdalus Love your 'didyoumean' file!
First observation re wrightia antidysenteria wrightia antidysenterica
from didyoumean file.
Wrightia antidysenteria
occurs 4 times in mw.txt while wrightia antidysenterica
occurs 43 times.
And first instance Wrightia antidysenteria
is under hw=kaliNga
Conclusion is that antidysenteria
should be changed to antidysenterica
AND that it is a
print change. This conclusion seems likely correct in this example due to the 43/4 instance counts.
But, to be pedantic,
@Amygdalus What method did you use to determine your suggested revisions?
For example, do you assert that there are NOT 2 different plants with similar names in this case?
If we can be highly certain that the first column is wrong and the 2nd column is right, that will simplify our correction task.
Noticed that there are 74 cases like: <bot>Terminalia Alata</bot> <bot>Tomentosa</bot>
I can think of no good reason for having 2 contiguous bot elements.
i.e. Should do regex replacement </bot> *<bot>
-> ' ' (one space).
e.g. <bot>Terminalia Alata Tomentosa</bot>
Sorry, but it's not my file. Mārcis wrote:
@funderburkjim list of a few hundred possible errors by @artanat in digital copy of MW
I just wrote that all this columns should be checked manually.
@Amygdalus Sorry, misunderstood.
@gasyoun / @artanat Can you indicate your method? So we have better idea of what we need to check further before making corrections?
Can you indicate your method
Fuzzy search in Excel. But initially we parsed 3 websites with plant names and when the search did not return an entry, the website itself proposed as a result, based on fuzzy search on the website itself. Similar to the o_vs_O method. The shorter entries are not as obvious (1 word vs. 2-3 word entries), the longer are rather obvious and ready to be implemented, without checking with books other that the MW original scan.
OK. Helpful comments.
Made change to MW re </bot> <bot>
to space. See csl-orig commit above.
Marcis points out an oddity from the commit:
<bot>Areca Faufel or Catechu</bot>
With question should the or cases be united?
Agree it is odd. Not sure whether it should be changed, such as to
<bot>Areca Faufel</bot> or <bot>Catechu</bot>
Agree it is odd. Not sure whether it should be changed, such as to
It should be for our purpose not:
<bot>Areca Faufel</bot> or <bot>Catechu</bot>
but
<bot>Areca Faufel</bot> or <bot>Areca Catechu</bot>
or, not to change the print, like
<bot tag=Areca Faufel>Areca Faufel</bot> or <bot tag=Areca Catechu>Catechu</bot>
It's a long road and I would want to start walking it, because it will take years...
@gasyoun / @funderburkjim I fail to see any reason why it should take so long. [My opinion is it's a matter of just few days, if a right person with right mindset acts upon it.]
Anyway, I think this issue may be safely closed now, as the subject matter would be dealt at #74 in future.
Agree that this issue can be closed -- The content above needs to be consulted when we tackle #74.
In http://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html
That's the prehistory.
To parse
<bot>Hedysarum_Gangeticum</bot>
is easy. But how to parse<bot>Musa</bot>_<bot>Paradisiaca</bot>
and to both words in 1 bot tag, not 2.Some interesting cases
Jim, I would like to see the XML data in XLS or a list: Premna Spinosa, see havirmantha Premna Longifolia, see havirmantha Hibiscus Mutabilis, see sthalanīraja
@funderburkjim if you can extract the data, @Amygdalus can help us. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/249 is just the beginning. Not sure what others think of how important it is. The first stage does not look too tough. It's a long road and I would want to start walking it, because it will take years...