lingpy / getcl

Simple command line tool to retrieve a concept list.
MIT License
0 stars 0 forks source link

Lookup-script finds additional mappings to Concepticon that getcl does not find #3

Closed FredericBlum closed 1 year ago

FredericBlum commented 2 years ago

Currently, my script ids_lookup finds +100 concepts that are not found via the getcl lookup.

You suggested that this might come out of the fact that entries in Concepticon are not split. Do you mean the column spanish by this? Because this column gets indeed split in my script.

Further linked to the same issue at pysem

FredericBlum commented 1 year ago

I digged into the code, and it seems like the major difference is that the lookup of getcl is based on the Description column, while my script takes Meaning from FormTable as input, which is created already during the dictionary creation. While in some dictionaries, this was not a major issue, it is for the toolbox-dictionaries, which come with extensive descriptions of every sense. Code in getcl: https://github.com/lingpy/getcl/blob/main/src/getcl.py#L38-L46

Proposed change: Add another argument which takes a column of the sense-table as input, Description per default. My problem right now: I am not sure how I can access the Gloss column when iterating through: for sense in ds.objects("SenseTable"):

Sense has no attribute Gloss, probably because it is custom added. I received the following str() outputs: print(str(sense)): print(str(sense.cldf)) : Namespace(id='SN001623', description='remar.', entryReference='LX000224')

This also exemplifies one of many possible problems: Even though remar is part of the Mapping, it cannot be found because of the dot. In other cases, however, there is simply extensive description, so removing a dot is not sufficient. What do you think @LinguList ?

LinguList commented 1 year ago

The mapping procedure is pretty advanced, specifically for English, which is based on pysem, with a method that scores up to 20 points for matches between a gloss in a source and a gloss in Concepticon (using all observed glosses in a sorted fashion for a given Concept Set in Concepticon).

Currently, yes, it seems that the description being used in the sense table is actually a problem, if it has long parts with commas. But then, I think again, that this is due to the potentially problematic compilation of the description itself? In my opinion, a sense is a part of a meaning, a meaning is a unique description of the meaning in a dictionary. The senses taken together make up of the meaning, and we map senses to Concepticon. Due to the CLDF specification, a counterintuitive aspect of senses is that they are UNIQUE with respect to one Entry. So they do not repeat. One could add an additional repeating sense as a property but this is why we have the identifiers like hand-1 hand-2, to keep them unique.

In Description, we have the gloss of a sense. The description is NOT the original gloss of the meaning, but the gloss of the meaning split into parts using some splitter (comma, semi-colon, etc.). So I do not see why one would need an additional gloss, specifically since pysem allows to identify partial matches of a Concept Set, but this is organized with a threshold. But we use ANY threshold here, so I again do not understand why there are matches that are made and those that aren't.

I think we really need an example now, where we exercise the behavior of to_concepticon and see where it fails, given a Sense.description in a concrete dataset. Could you find such a case, @Tarotis?

LinguList commented 1 year ago

And maybe please double check in pysem the function to_concepticon, which should be sufficiently described in the library.

FredericBlum commented 1 year ago

I understood the Dictionary-guidelines the other way round: Meaning corresponds to the Gloss, and Description has the full-text description for the Gloss. See the mapping of Dictionaria: https://github.com/dictionaria/pydictionaria/blob/master/docs/md-json-properties.md#sense_map

I tested this specifically for the zariquieyisconahua dataset. I will create an issue there and mark the relevant code.

Note: This only explains the large differences for the two toolbox-dictionaries. I am yet to look into the differences for the other cases-

FredericBlum commented 1 year ago

As I wasn't quite sure how to turn into the package within my venv, I decided to copy the functions locally.

getcl: https://github.com/pano-tacanan-history/zariquieyisconahua/blob/main/raw/check_getcl.py to_concepticon2: https://github.com/pano-tacanan-history/zariquieyisconahua/blob/main/raw/to_concepticon2.py

It can be run by python raw/check_getcl.py In the current version, it outputs the Description, and, if found, a possible match.

Part of the output, which exemplifies the problems:

{'río, agua que corre.': [['666', 'RIVER', 'noun', 15]]}
{'remar.': []}
{'nombre genérico para designar a los peces.': []}
LinguList commented 1 year ago

I understood the Dictionary-guidelines the other way round: Meaning corresponds to the Gloss, and Description has the full-text description for the Gloss. See the mapping of Dictionaria: https://github.com/dictionaria/pydictionaria/blob/master/docs/md-json-properties.md#sense_map

I just checked the Sanzhi dictionary, and to my surprise, the segmentation by ; in this dictionary is not represented by two senses, but by one, so [[negation suffix]; not]](https://dictionaria.clld.org/units/sanzhi-a-) is one sense. This shows that you are right, that the description has the full gloss for a meaning, and a sense is -- but we should discuss or consult @xrotwang @johenglisch here -- is in some sense just what I'd call a "meaning": a full description of the meaning of a word, which is not primarily subdivided into smaller units (?). OR: it is just that this possible level of detail was not employed in the Sanzhi dictionary, because it was not consistently pursued in the original dictionary.

But, and this is important: in our examples with @martino-vic, we did split original meanings (in my sense) into senses, and the CLDF specification for a Dictionary allows me to have 3 senses which all link to the same entry. So it enables us to be strict with respect to the splitting of an original meaning description into smaller parts (it does not allow us to cluster senses by descriptions, but we can add a column that does this in theory).

LinguList commented 1 year ago

nombre genérico para designar a los peces': Not mapped to FISH`, possibly because of the extensive description provided by the original author

This is exactly the case where not getcl but pysem mapping does NOT allow for a mapping, since the pysem algorithm for mapping, which is in the function to_concepticon, ultimately goes for identity matches, but tries to represent a gloss string before in different versions. One could modify the pysem function, but I am not sure if that is desired, since in the past we had many colleagues then starting to accept all mappings, even the completely imperfect ones, like "a kind of a plant" to plant. So I consider this dangerous and would say that it should be at least flagged when using this kind of mapping of an extensive gloss without comma or separators in brackets to one concept.

Consider the following cases of how to_concepticon works:

from pysem import to_concepticon

In [3]: to_concepticon([{"gloss": "mano del dio", "pos": "noun"}], language="es", pos_ref="pos")
Out[3]: {'mano del dio': []}

In [4]: to_concepticon([{"gloss": "mano (del dio)", "pos": "noun"}], language="es", pos_ref="pos")
Out[4]: {'mano (del dio)': [['1277', 'HAND', 'noun', 12]]}

So I'd argue it is a feature that was deliberately chosen in order to avoid that we are flooded with problematic mappings.

LinguList commented 1 year ago

BUT, and this is important, as it shows that the issue is again not getcl but pysem's glossing algorithms, if you check the specification of to_concepticon, you cann see that there is a splitter argument, that allows you to define a character or a range of characters by which you want to split your test. Nothing prevents you from adding a space there!

In [25]: to_concepticon([{"gloss": "mano del dios"}], language="es", max_matches=10, splitter=",|;| | or ")
Out[25]: 
{'mano del dios': [['1277', 'HAND', 'noun', 15],
  ['3231', 'DEITY', 'noun', 15],
  ['1944', 'GOD', 'noun', 15]]}

And this behavior, if wished for, can of course be passed on to getcl, by allowing to define the more extensive parameters of the glossing algorithms in pysem.

LinguList commented 1 year ago
In [26]: to_concepticon([{"gloss": "nombre genérico para designar a los peces."}], language="es", max_matches=10, splitter=",|;| | or ")
Out[26]: {'nombre genérico para designar a los peces.': [['1405', 'NAME', 'noun', 15]]}

Well, you see, we don't map to fish, since it is plural here. But this depends on existing mappings in Spanish to Concepticon, which simply don't have "fish" in plural.

Since we don't have ambitions to use extra stemmers for every word in a gloss string and every language in Concepticon and pysem, I'd argue that any desire to extend upon this should be done on a language-specific basis on its own rights as a -- potentially more accurate -- alternative to pysem's functions. But one should think twice whether it is worth the pain.

LinguList commented 1 year ago

Last point: to access columns not in CLDF standard in the iteration, just use object.data, which has this. You can easily test this in an interactive Python console, like ipython, where you just type object.+tab and can then see all the attributes and explore the object here. I did not test, but maybe just try:

for sense in ds.objects("SenseTable"):
    print(sense.data)
xrotwang commented 1 year ago

I just checked the Sanzhi dictionary, and to my surprise, the segmentation by ; in this dictionary is not represented by two senses, but by one, so [[negation suffix]; not]](https://dictionaria.clld.org/units/sanzhi-a-) is one sense. This shows that you are right, that the description has the full gloss for a meaning, and a sense is -- but we should discuss or consult @xrotwang @johenglisch here -- is in some sense just what I'd call a "meaning": a full description of the meaning of a word, which is not primarily subdivided into smaller units (?). OR: it is just that this possible level of detail was not employed in the Sanzhi dictionary, because it was not consistently pursued in the original dictionary.

I think the idea of senses in Dictionaria was what @LinguList describes: Senses are somewhat fine-grained and related to Entries in a many-to-one relation. And to some extent, that's true. E.g. the Teop dictionary has (slightly) more senses than entries: https://github.com/dictionaria/teop/tree/master/cldf#table-entriescsv You can find these by searching the "Meaning Description" column at https://dictionaria.clld.org/contributions/teop#twords for ; (note the leading space). So whenever authors explicitly marked multiple senses in toolbox (or wherever), using \sn markers, this would lead to multiple senses in the Dictionaria CLDF. But more often authors seem to have crammed multiple senses into one \sn marker, using ; (no leading space) as separator.

FredericBlum commented 1 year ago

This is exactly the case where not getcl but pysem mapping does NOT allow for a mapping, since the pysem algorithm for mapping, which is in the function to_concepticon, ultimately goes for identity matches, but tries to represent a gloss string before in different versions. One could modify the pysem function, but I am not sure if that is desired, since in the past we had many colleagues then starting to accept all mappings, even the completely imperfect ones, like "a kind of a plant" to plant. So I consider this dangerous and would say that it should be at least flagged when using this kind of mapping of an extensive gloss without comma or separators in brackets to one concept. So I'd argue it is a feature that was deliberately chosen in order to avoid that we are flooded with problematic mappings.

So, one option could be to set a keyword map_to_gloss (default: False) which does not take the description, but rather the gloss as input. For dictionaries like the two from Roberto, which come with extensive descriptions, the difference is immense: I found more than twice the size of concepts by mapping with Gloss. Of course I had to remove some erroneous mappings like the plant case you described, but then again, I have to go through the mappings individually anyway. And in other cases, the problem is that there are five or more matches, where I had to decide for the best match manually.

LinguList commented 1 year ago

The easier option is to set the splitter keyword to contain a space, as I said. You can then check if that is useful or not.

LinguList commented 1 year ago

So one would just pass the splitter as an argument, with a default value of ,|;| or |/ and you can then modify this to splitter=",|;| | or |/.

LinguList commented 1 year ago

Do you want to try and add this to the getcl command, @Tarotis?

FredericBlum commented 1 year ago

I fail to understand how this would enable getcl to map the description nombre genérico para designar a los peces to pez.

FredericBlum commented 1 year ago

While it would be interesting to add this splitter as an additional keyword and may solve other cases, I feel like it misses my original problem.

FredericBlum commented 1 year ago

Wouldn't this also cause a long description to provide me with multiple mappings? Imagine the following description, taken out of the Isconahua dictionary:

perezoso de aproximadamente 20 centímetros de largo. Emite un sonido parecido al del mono ardilla o fraile (Saimiri sciureus) y se alimenta de hojas. Tradicionalmente, los iskonawas no comían la carne de este animal, ya que tiene alma (ñusin) y produce enfermedades (cutipa) en quienes lo comen. Los mestizos lo llaman pelejo.

If I understood your idea correctly, this description would give mappings to at least the following: perezoso, largo, sonido, mono, ardillam fraile, hojas, carne, animal, enerfmedades, alma

I feel like this inflates the produced mappings. If we map to glosses, we will have some plants that are mapped as PLANT, but I feel like their number will be way lower than the multitude of mappings involved in the other proposal.

LinguList commented 1 year ago

Okay, how do you get your GLOSSES in teh first instance? Is that automatically odne, or manually?

LinguList commented 1 year ago

Because, nothing prevents you from entering GLOSS as Description and then adding a Long_Description as another column with the original Description, right?

LinguList commented 1 year ago

This would solve the problem even without doing ANYTHING to the getcl code. All it would require is that you get the -- hopefully curated -- glosses from somewhere.

LinguList commented 1 year ago

Our misunderstanding is that I did not understand that you have the glosses alongside the description, I think. But what I still do not understand is how you received the glosses. If they are given with the data, renaming as I suggested is easier than changing the code, right?

FredericBlum commented 1 year ago

There are two separate lines in toolbox format, one for descriptions, one for glosses. So yes, the gloss comes directly out of the data. What speaks against renaming, however, is that the Dictionaria default maps the descriptions to the Description columns (see here again: https://github.com/dictionaria/pydictionaria/blob/master/docs/md-json-properties.md#sense_map). And as you pointed out already, the gloss is not equal to a description.

So renaming would involve fiddling with the Dictionaria settings. Then again, it's not a Dictionaria publication, so we might as well do this.

LinguList commented 1 year ago

I opt for renaming, this can be done in one line in CLDF, and we have already clarified that the idea of the Description is the Sense, in which we used it with Viktor. So you add a long description and that's it.

LinguList commented 1 year ago

I assume @johenglisch can help with the renaming, right?

FredericBlum commented 1 year ago

After talking with Johannes yesterday, I changed the mappings from toolbox to the newly proposed names. This should also close https://github.com/lingpy/pysem/issues/9