sanskrit-lexicon / MWS

Monier Monier-Williams, Sir; A Sanskrit-English dictionary. Oxford, 1899
Other
7 stars 5 forks source link

2008 vs. 2018 <bot> markup cleanup #51

Closed gasyoun closed 3 months ago

gasyoun commented 6 years ago

In http://www.sanskrit-lexicon.uni-koeln.de/talkMay2008/markingMonier.html

  1. transliteration of Greek
  2. botanical terms Over 15000 words are tagged as parts of scientific names of plants, and another 500 of animals. Linking these words to a currently accepted authoritative taxonomy database would be useful.
  3. literary sources Currently, one can get easy cross references from an abbreviated literary source in the text to the name of the work or author. However, it would be nice to have links to specific text from a work. This task probably requires, in general, as yet undigitized information. However, some sub-tasks, such as links to the Paninian references, might be doable.

That's the prehistory.

  1. Finished in 2017.
  2. Not "15000 words", but 15826 cases or 8408 entries that contain . There is a new woman in town @Amygdalus , she has a PhD in biology. We can start what was planned in 2008 now.
  3. Rigveda and Panini will be the starting points, https://github.com/sanskrit-lexicon/Cologne/issues/93 I hope @amordvinov will help us here.

To parse <bot>Hedysarum_Gangeticum</bot> is easy. But how to parse <bot>Musa</bot>_<bot>Paradisiaca</bot> and to both words in 1 bot tag, not 2.

Some interesting cases

<c><bot>sesamum</bot>_grain</c>
<c><quote>_sacrificial_grain_</quote>_,_<bot>sesamum</bot></c>

<c><bot>Premna</bot>_<bot>Spinosa</bot>_or_<bot>Longifolia</bot></c>
<c>the_<bot>Kadamba</bot>_tree</c>
<c><quote>_impregnated_with_oil_</quote>_,_<bot>Pinus</bot>_<bot>Deodora</bot></c>
<c>the_flower_of_<abE>Hib</abE><bot>Hibiscus</bot>_<abE>Mut</abE><bot>Mutabilis</bot></c>

Jim, I would like to see the XML data in XLS or a list: Premna Spinosa, see havirmantha Premna Longifolia, see havirmantha Hibiscus Mutabilis, see sthalanīraja

@funderburkjim if you can extract the data, @Amygdalus can help us. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/249 is just the beginning. Not sure what others think of how important it is. The first stage does not look too tough. It's a long road and I would want to start walking it, because it will take years...

funderburkjim commented 6 years ago

Here is a list to get you started. It is a csv file. Records are like:

aMSumatI,72,Hedysarum Gangeticum
aMSumatPalA,73,Musa Paradisiaca
akarA,168,Emblic Myrobalan
akarA,168,Phyllanthus Emblica
...
akzamAlA,495,Eleocarpus  <<< (only one word here)

key1 (SLP) , L,contents of bot tag. Records are in L-order, and there may be multiple entries for a given L (like akarA).

allbot.txt

funderburkjim commented 6 years ago

merging

Made changes to mw.xml to (a) Replace the _ with space in bot tags (b) Merge consecutive bot tags.

The listing above was done after these changes.

gasyoun commented 6 years ago

Here is a list to get you started. It is a csv file.

Please add the code or Regex you used to get it.

Merge consecutive bot tags

Great, thanks, helps a lot. Jim, can you a link from the SLP1 name to MW Cologne URL, please? 7317 clicking is 7317 better than copypasting.

<c><bot>Guilandina</bot>_or_<bot>Hyperanthera</bot>_<bot>Moringa</bot></c> <ls>L.</ls>

<ls>L.</ls> after botanical nomenclature is not L[exicographer], but Carl Linnaeus.

So the L. needs to kept in a separate column.

<c>the_plant_<bot>Convolvolus</bot>_<bot>Argenteus</bot>_or_<bot>Ipomoea</bot>_<bot>Pes</bot>_<bot>Caprae</bot>_<bot>Roth.</bot></c> 

As you can see it's not always Linnaeus, in case above it's Roth, but not Sanskritist, it's https://en.wikipedia.org/wiki/Albrecht_Wilhelm_Roth So instead of Roth. it should be Roth. There is no such grass as Roth. :)

There are 45 cases of .</bot> = 45 wrong markups.

<bot>Roxb.</bot>
<bot>Hex.</bot>
<bot>Gaertn.</bot>
<bot>Nees.</bot>
<bot>Schott.</bot>
<bot>Bl.</bot>
<bot>Wall.</bot>
<bot>Benth.</bot>
<bot>Spreng.</bot>
<bot>Willd.</bot>
<bot>Schott.</bot>
<bot>Wall.</bot>

and wrongly marked

<bot>Erycibe_Paniculata_Roxb.</bot> -> <bot>Erycibe_Paniculata</bot><ls>Roxb.</ls>

<bot>Blyxa_Octandra_Rich.</bot> as well, I guess

<bot>Roxburghii_Wall.</bot>

<bot>Arabicus._I.</bot> -> <ls>I.</ls> ?

Quite many. Everything as a 3rd element with a dot is a surname after a plant.

<c>a_kind_of_grass_<p>Deotar_,_<bot>Andropogon</bot>_<bot>Serratus</bot></p></c>

The surrounding text from should be kept as well. In this case "a_kind_ofgrass" in a new column. Text before.

And another column - text after. See

<ls>L.</ls> <p><c>prob.</c>~ <ab>w.r.</ab>~<c>for</c>~<s>-saha</s>

Now we have

Zanthoxylon Alatum
Zanthoxylon Hastile
Zanthoxylon Rhetsa

As one entry great. But additionaly to this column Yuliya wants to have them separated as well. So she can sort in all possible ways. Do I express myself correct, @Amygdalus?

Convolvulus Paniculatus -> Convolvulus unmarked [L=203591]

Amygdalus commented 6 years ago

It seems to me, that you are right. I would like to have a list with several columns; a) translit b) name of genus c) epithet of species - should start from lowercase letter if it's possible to do by machine method d) name of author of description e) another information (like which part of a plant is mentioned - seeds, roots, etc.). f) Harvard translit j) another information needed

And, please, could you give me some instructions about my table in Excel: how should I present the data in columns to quick 'return' to MW? Let's think about return of data to MW. Copypaste with over 7000 names is awful.

Sorry for English. French is easier. :)

P.S. Didn't understand the last sentence: Convolvulus Paniculatus -> Convolvulus unmarked [L=203591] Paniculatus - is a species' epithet. It should be kept.

gasyoun commented 6 years ago

P.S. Didn't understand the last sentence: Convolvulus Paniculatus -> Convolvulus unmarked [L=203591] Paniculatus - is a species' epithet. It should be kept.

Everything is fine with the printed text, but there is a digital issue. Text is perfect, but sometimes markup issues occur. As the Convolvulus is left unmarked in the orignal file, it will not be in XLS file of yours as well. So your work will help to clean the XML file as well. It's both ways.

funderburkjim commented 6 years ago

Corrections: 1

Several corrections made based on the above comment. Then allbot output regenerated.

Have generated output in two forms:

Statistics on number of words in bot tag

9418 lines written to allbot1.txt
NUMBER OF WORDS IN <BOT>: FREQUENCY
1 2290
2 7076
3 50
4 2

first study of the single instance words

Casual observation of the nwords=1 cases of the allbot1 file led to the suspicion that some of the <bot>x</bot> instances where x is a word that starts with a lower-case letter are mis-marked. For instance I suspect that <bot>sesamum</bot> should not have the <bot> tag -- rather it is just a common name and should be treated like any other text. That's the hypothesis.

A filter of allbot1 for such cases results in 237 such instances, in 46 different words. The list is allbot1_lower_one.txt.

Maybe @Amygdalus can identify the ones of these where the <bot> tag should be removed, and I'll do those corrections.

gasyoun commented 6 years ago

Maybe @Amygdalus can identify the ones of these where the tag should be removed, and I'll do those corrections.

Yes, that would be perfect. Thanks, Jim.

funderburkjim commented 6 years ago

Bengalensis : plant and animal

Noticed an instance under headword SatacCada:

a sort of woodpecker , <bio>Picus</bio> <bot>Bengalensis</bot> <ls>L.</ls>
Changing to:
a sort of woodpecker , <bio>Picus Bengalensis</bot> <ls>L.</ls>

There are 6 more </bio> <bot> cases to examine for mis-marking, and 26 cases of form </bot> <bio> to examine.

Incidentally the <bio>X</bio> tag in mw.xml is intended for scientific names of animals.

Gracula Religiosa is a bird

This incorrectly marked as <bot>. 6 cases

funderburkjim commented 6 years ago

Corrections 2

Convolvulus Paniculatus -> Convolvulus unmarked

There are many like this. Present work identified and corrected markup for 711 of this type.

temp_bot11_log.txt shows the newly marked instances (228 different instances). My glance through this file suggests the corrections didn't generate any false positives. @Amygdalus might check that all these look like legitimate botanical names.

After correction, allbot1 files regenerated, under name allbot1a:

Revised Statistics on number of words in bot tag

9379 lines written to allbot1a.txt
NUMBER OF WORDS IN <BOT>: FREQUENCY
1 1541
2 7786
3 50
4 2
gasyoun commented 6 years ago

Present work identified and corrected markup for 711 of this type.

Good to know it can be partly automated as well.

Amygdalus commented 6 years ago

Dear Jim! I would like to be sure that I understand your task correctly:

  1. "or instance I suspect that sesamum should not have the tag -- rather it is just a common name and should be treated like any other text" Should I identify 46 words as botanic names? In this case Sesamum is a latin name of plant genus. As far as I understand, you need some marks - if it's a common name or a latin name, isn't it?

  2. "There are 6 more cases to examine for mis-marking, and 26 cases of form to examine". Sorry, could you give me a list to look at it? Or how to see it in the text?

3."temp_bot11_log.txt shows the newly marked instances (228 different instances). My glance through this file suggests the corrections didn't generate any false positives". Should I check 228 (or more?) names - to identify their botanical issue?

gasyoun commented 6 years ago

f. Coc. Tomentosus &c. VarBṛS. ; Bhpr. &c. [L=253014]

Coc. unmarked.

m. Blumea lacera L. [L=84090]

lacera unmarked.

funderburkjim commented 6 years ago

@Amygdalus

Please forgive me for letting this slip by. I'm still not ready to continue the investigation, but have not forgotten about it.

I was reminded about it when commenting on a reorganization of the Cologne digitization of Meulenbeld's Sanskrit Name of Plants; brief comments here; and some orientation in the front matter.

Is this something you still have an interest in pursuing, when I get the time to attend to my part ?

Also, could you tell me a bit about yourself, so I'll know how to think about your interest in this subject?

Amygdalus commented 6 years ago

Dear Jim, I remember about this work and I find a true interest in it. So, I'll wait for your possibility to continue working with it (and I use this time like a chance to finish or to ameliorate some of my own affairs). I need only an Excel file which will be simply reconvertable for you in MW after finishing the task. About my interest: I've got PhD in biology in 2009 and worked in university on ecology chair till 2017. My specialisation is plant ecology, but of course I have a possibility to recognise and to renew plant names. I already did the same type of work for Michel Angot's article, with whom I work as a translator from French into Russian from 2005. The main interest is in using professional possibilities for development of Sanskrit's studies - I look at it like at useful hobby. My husband is also connected to sanskrit. So, I hope to get the Excel file some day, but I don't hurry the participants because of my own affairs with children (7 monthes and 7 years) and finishing proofreading a translation of a book.

drdhaval2785 commented 3 years ago

@Amygdalus Do you still have interest in this work?

gasyoun commented 3 years ago

Guess not.

Amygdalus commented 3 years ago

Dear @drdhaval2785 ! Yes, I'm still interested in this work, but I don't work now with Marcis Gasuns at all. I'm a member of another team of sanskrit specialists. So if this work is actual, I need the same thing as in 2018: a clear instructions about the document which I should transmit to the project during the work.

funderburkjim commented 3 years ago

In reviewing the comment history, I must admit I don't have a clear idea of the objectives.

This was first brought up by Marcis, so he and @Amygdalus should develop the requirements and objectives. Then maybe Dhaval and/or I can help.

Even more ideal would be if @Amygdalus can do programming. It's always better if the person with the primary interest does most of the work. I would prefer in this instance at this time to advise rather than to do all the coding.

gasyoun commented 3 years ago

@Amygdalus let me ask if you have a picture in mind of what is required and what can actually be done by you?

Amygdalus commented 3 years ago

Dear @funderburkjim ! I can't do programming - I have another specialisation. And this project is only one of many others things I have to do - it's not my primary interest.

As far as I understand, firstly the question about the mode of appearance of this renewed names in the dictionary should be decided. But I don't depend on your decisions: I can do my work by adding a new column of new names - near already mentioned species from MW. In my Excel file. Than I can transmit this work to those who are responsible for the dictionary - Dhaval or you.

If my work is not applicable to the dictionary (or nobody knows how to include it into MW), I can put it to open sources as a free reference book. Such work is not a work of one person (I mean the botanical side of work). For example, I should address to the specialists of regional flora. So if we decide to do this work, I'll invite another people interested in it and I'll consult with botanists.

@drdhaval2785 I can't check two lists (plants and animals) only by having a look on them. I should read each line. So only during the work I can find all mistakes.

funderburkjim commented 3 years ago

@Amygdalus Thanks for your reply.

Suggest you provide CSV form of your Excel file, as it will be easier for further use in non-excel workflows. (CSV = comma-separated-values. Sometimes, if the values have commas, it is better to use TSV = tab-separated-values. Probably excel has a way to export a table into this form )

You can upload such a text file to this issue simply by dragging it into a comment.

Also, suggest you send a very early form of the table you have in mind, with only a few rows. That way, we can understand what is involved and have something tangible to comment on.

Do you need something from us to get started?

gasyoun commented 3 years ago

Probably excel has a way to export a table into this form

Exactly.

Amygdalus commented 3 years ago

@funderburkjim I need nothing. I have an Excel file with data, and after 15 january I'll organize my time with this activity in my timetable.

funderburkjim commented 3 years ago

@Amygdalus ok. Sounds good.

Let me mention one thought that may (or may not) be something related to what you are doing.

It is likely that there are some errors in our <bot> and <bio> tagging of mw. This could be:

When you notice errors in mw (such as in that allbot1.txt file mentioned in a comment above), we would like to correct the errors; so make some kind of list of such errors as you notice them. Again, just a suggestion.

Amygdalus commented 3 years ago

@funderburkjim yes, Jim, of course.

gasyoun commented 3 years ago

@funderburkjim list of a few hundred possible errors by @artanat in digital copy of MW

barr

By a fuzzy-search similar algorithm.

Did you mean.xlsx

Amygdalus commented 3 years ago

About the actualisation of names of plants: I wait now for some books from India in sanskrit for my work, because after some spade-work it became obvious that the task is more complicated than it was expected. So I need some original sources to go further. As for the list of possible errors above - there are some mistakes in correction and the list should be checked manually.

gasyoun commented 3 years ago

some books from India

Would love to know the list, because most books are already scanned.

more complicated than it was expected

Bet it is.

I need some original sources

Please let me know the list. I'll see what I can do, to move things forward.

some mistakes in correction and the list should be checked manually.

Sure, but it's the first thing to start with. To weed out the big dirt. @funderburkjim most of them OCR errors.

funderburkjim commented 3 years ago

@Amygdalus Love your 'didyoumean' file!

First observation re wrightia antidysenteria wrightia antidysenterica from didyoumean file.

Wrightia antidysenteria occurs 4 times in mw.txt while wrightia antidysenterica occurs 43 times.

And first instance Wrightia antidysenteria is under hw=kaliNga image

Conclusion is that antidysenteria should be changed to antidysenterica AND that it is a print change. This conclusion seems likely correct in this example due to the 43/4 instance counts.

But, to be pedantic, @Amygdalus What method did you use to determine your suggested revisions?
For example, do you assert that there are NOT 2 different plants with similar names in this case?

If we can be highly certain that the first column is wrong and the 2nd column is right, that will simplify our correction task.

funderburkjim commented 3 years ago

split 'bot' markup

Noticed that there are 74 cases like: <bot>Terminalia Alata</bot> <bot>Tomentosa</bot> I can think of no good reason for having 2 contiguous bot elements. i.e. Should do regex replacement </bot> *<bot> -> ' ' (one space). e.g. <bot>Terminalia Alata Tomentosa</bot>

Amygdalus commented 3 years ago

Sorry, but it's not my file. Mārcis wrote:

@funderburkjim list of a few hundred possible errors by @artanat in digital copy of MW

I just wrote that all this columns should be checked manually.

funderburkjim commented 3 years ago

@Amygdalus Sorry, misunderstood.

@gasyoun / @artanat Can you indicate your method? So we have better idea of what we need to check further before making corrections?

gasyoun commented 3 years ago

Can you indicate your method

Fuzzy search in Excel. But initially we parsed 3 websites with plant names and when the search did not return an entry, the website itself proposed as a result, based on fuzzy search on the website itself. Similar to the o_vs_O method. The shorter entries are not as obvious (1 word vs. 2-3 word entries), the longer are rather obvious and ready to be implemented, without checking with books other that the MW original scan.

funderburkjim commented 3 years ago

OK. Helpful comments.

funderburkjim commented 3 years ago

Made change to MW re </bot> <bot> to space. See csl-orig commit above.

funderburkjim commented 3 years ago

Marcis points out an oddity from the commit:

<bot>Areca Faufel or Catechu</bot>

With question should the or cases be united?

Agree it is odd. Not sure whether it should be changed, such as to <bot>Areca Faufel</bot> or <bot>Catechu</bot>

gasyoun commented 3 years ago

Agree it is odd. Not sure whether it should be changed, such as to

It should be for our purpose not: <bot>Areca Faufel</bot> or <bot>Catechu</bot>

but

<bot>Areca Faufel</bot> or <bot>Areca Catechu</bot>

or, not to change the print, like

<bot tag=Areca Faufel>Areca Faufel</bot> or <bot tag=Areca Catechu>Catechu</bot>

Andhrabharati commented 3 months ago

It's a long road and I would want to start walking it, because it will take years...

@gasyoun / @funderburkjim I fail to see any reason why it should take so long. [My opinion is it's a matter of just few days, if a right person with right mindset acts upon it.]

Anyway, I think this issue may be safely closed now, as the subject matter would be dealt at #74 in future.

funderburkjim commented 3 months ago

Agree that this issue can be closed -- The content above needs to be consulted when we tackle #74.