Open LinguList opened 8 years ago
In the retrospective, I would also say now, that there are only two basic types of cells in a wordlist (but both need to be supported): string and list of strings. List of strings would be needed for segments (tokens), and partial cognate sets, and be produced by the function lambda x: x.split(' ')
. Strings for all the rest.
While we could fix the behaviour easily by column name, we need to be aware that we need still some flexibility, as people may want to work with two different kinds of partial cognates, expert vs. machine, and also different kinds of phonetic segments. So the __init__
function should allow the user to specify the content of a column.
Integers are used in the current lingpy for various operations, but one can potentially ignore them. They are, however, useful in creating cognate identifiers, as one can make sure that they are unique, by just incrementing a number, and this is done in a couple of functions, like, e.g., wordlist.renumber('column')
, which play an important role at the moment. The question is, whether integers should be added as a third cell-data-type (with a corresponding list of integers for partial cognates), or whether a similar behaviour could be triggered by using strings. An advantage of integer cognate IDs is that one can easily add a new one which has not yet been used by just incrementing the max(cogids). If there's a workaround for similar behaviour in strings, this would spare us to add integers as a cell-data-format.
@LinguList this strikes me as a good case for a lightweight class with inheritance e.g.:
class BaseCell(object):
def __init__(self, value):
self.value = value
def __str__(self):
return self.value
def __bytes__(self):
return self.decode('utf8')
class StringCell(BaseCell):
pass
class ListCell(BaseCell):
def __init__(self, value):
self.value = [_.strip() for _ in value.split(",") if len(_.strip())]
def __repr__(self):
return ", ".join(self.value)
This would then allow people to subclass and make their own cell types (and an IntegerCell is easy to define).
Adding a lt etc would make it sortable (and a check for integer-icity could happen in there if wanted). So then sorting a wordlist is just a matter of using sorted()
Alternatively, the data could be analysed row-wise e.g.:
DEFAULT_CELLS = {
'ID': StringCell,
'CONCEPT': StringCell,
'COUNTERPART': StringCell,
'IPA': StringCell,
'DOCULECT': StringCell,
'COGID': StringCell,
}
NECESSARY_CELLS = ['ID', 'CONCEPT', 'COUNTERPART']
class Record(object):
def __init__(self, **values):
for k, v in values.items():
# find cell type and instantiate. Default to StringCell if we don't know
cell = DEFAULT_CELLS.get(k, StringCell)(v)
setattr(self, k, cell)
for k in NECESSARY_CELLS:
assert getattr(self, k, None) is not None, 'Record needs a %s' % k
This looks very convincing with the basic cell hierarchy. I wonder if this might slow down the processing, as @xrotwang now confirmed that defaultdicts really slow down processing compared to plain lists (which is the reason why I originally used the strange dictionary-with-lists-structure for lingpy). But if this is only used for initialization and updating, it might be the approach of choice. And I think if we agree on those either two (strings + list of strings) or the four (strings + list of strings, ints + list of ints) basic types, also with respect to cldf maybe, it should be already a great improvement over lingpy's current flexibility that allows to define everything and nothing.
premature optimisation is the root of all evil :)
Sure, the hierarchy here could be overkill. Perhaps losing the BaseCell/StringCell/ListCell distinction and just having a single object 'Cell' with logic in it could be enough, or just using python base objects e.g. string/list?
Thinking more about it though, this would be natural place to have transcoders and validators e.g.:
class CLPACell(StringCell):
def __init__(self, value):
self._is_valid(value) # raises CLPAValidationError on Invalid SAMPA input
self.value = self.to_clpa(value)
The costly steps are at the initialisation stage (anything else expensive can be cached or memoized). In terms of speed, wordlist
and all sound class converters etc just need to know how to get the value from a cell (mycell.value).
Seems convincing to me. In the beginning, I was also thinking of having a "sequence" class, which I discarded at some point, but validators are surely needed, also for the CLPA enterprise, and currently, in lingpy, validation of normal sound classes is also not carried out in an expressively transparent way, so having one loose lingpy evaluation and one more strict clpa evaluation (the latter as an external function from pyclpa) would be definitely useful.
If we would really want to follow this path, i.e. build functionality into row objects or even cell objects, I'd use the attrs library. This would give us lightweight objects with validation, conversion and representation as dicts. But I'm not sure people would want to customize wordlists on the cell level, rather than provide completely new wordlist implementations. Am 21.11.2016 16:17 schrieb "Johann-Mattis List" notifications@github.com:
Seems convincing to me. In the beginning, I was also thinking of having a "sequence" class, which I discarded at some point, but validators are surely needed, also for the CLPA enterprise, and currently, in lingpy, validation of normal sound classes is also not carried out in an expressively transparent way, so having one loose lingpy evaluation and one more strict clpa evaluation (the latter as an external function from pyclpa) would be definitely useful.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lingpy/lingpy3/issues/2#issuecomment-261965908, or mute the thread https://github.com/notifications/unsubscribe-auth/AA1HKPyLwKQ9y0E3MKfE_ZhKnqK_oUeLks5rAbYlgaJpZM4K38qV .
so, the most important aspect is the two (str + list) or four (str + list, int + list) types. These are essential, and without these, it won't work, and users need to be able to specify themselves, which columns belong to which type when they load a wordlist, potentially assuming standard datatypes for column names, such as "segments", "doculect", "concept", "cognate_set", etc.
I'm a bit confused regarding what kind of flexibility is required here. From the point of view of LingPy I guess the type of data in a cell is specified by the way the cell is used; or stating it the other way round: If a user calls a LingPy method specifying column tokens
as the one to look for segmented data, then LingPy will run value.split()
to get the tokens, right?
So the way I see it, we should not attach any behaviour to assumed standard column names (I think the current practice of passing the column name for specific kinds of data into functions is good (and default values for these names - like segments='segments'
- are ok, too)). And we should also not bloat the wordlist spec by adding type information. What we should have is a single place where the conversion functions for the four types are defined, but it should be the responsibility of each method operating on wordlists to call these conversion functions.
Okay, but we have one problem with diversity of column names due to different datatypes here. For cognate detection, we create cognate-sets, be the traditional "cogid", or the partial identifiers. This defaults to:
But this is not the only usecase, as we need to allow to specify the column name here, as we man want to (and we do and have done) run several analyses with different thresholds, in which these "references" (keyword "ref" in Alignments, and LexStat) are usually flexibliy modified, to allow for n different columns filled with alternative cognate sets.
Normally, we can live by having strings for these values, but for partial cognates, we cannot use strings, similarly for segment-like data, like
I don't really know how to best handle this. One could use a wildcard: all that ends in ID is integer, all that ends in IDS is [int(x) for x in y.split(' ')]
. Any other idea how to handle this in a principled way which leaves us enough flexibilty, especially to test cognate detection algorithms?
I don't follow. From my point of view it is solely the callers responsibility to make sure the data in the wordlist cells matches the requirements of the algorithms it is specified for. So if I want to run an algorithm that expects a cell of list type, I have to make sure to pass a suitable column name. Basically, I think this is simply a documentation issue, i.e. all methods operating on wordlists must make it explicit which type of data they expect in which cells.
ah, I see, this makes much more sense: so this would mean, that the data is, e.g., in ANY format in teh wordlist, but if you call it, constructing, for example, a partial etymolgoy dictionary, the caller will assume that this is some specific list-format, right?
I never looked at it from this side, but it makes the whole thing much, much easier, I think...
The callee, i.e. the called method makes the assumption, the caller makes the promise - but yes, I think we are now on the same page.
Yes, callee, I tend to confuse these things. Makes complete sense, and will probably incredibly clean up the current mess ;-)
My earlier thoughts and reports on functionality, which have not changed in large parts, are given here:
Generally, one should be able to trigger output like this:
Based on a file like this:
But now, that I'm smarter than in the past, I would not make this a class-attribute, as it has lead to inconsistencies in the current lingpy. It is also not consequent, as far as
language
andconcept
return one-dimensional lists so far.