sequence types - Githubissues

LinguList commented 4 years ago

It would be desirable to also transfer lingpy's basictypes structure to linse and use the time to also enhance it. The types ints, strings, and lists (and potentially alignments), or even a specific type for inter-linear-glossed text, which one might want to add, serve to store more complex sequence in sequential form, with a fixed datatype applying to each segment. Their __str__ method converts them to the form in which they are usually stored in our text files.

ints is a list of integers (used to represent partial cognate sets inside lingpy, and as part of this in additional operations)
floats is like ints with floats (used to represent prosodic weights, for example)
strings is a list of strings (like our Segments in cldf)
lists is basically a list os strings, but the list can be sliced in one form more, based on the secondary segmentation element (which is usually a + in addition to whitespace).
alignments (not really implemented) could be useful to represent alignments, since an alignment is nothing else than an originally sequence in which gaps have been inserted, so this would have a representation with and without gaps, sparing to use list-comprehensions in many lines of code
text (or phrase) can be seen as yet another instance of a sequence, but this would be separated in three ways: on the sound segment level, on the morpheme level, and on the phrase level

ints, floats, strings, and lists are already essential for lingpy's algorithms and mostly applied there.

xrotwang commented 4 years ago

What's the difference between the basictypes and what can be achieved with array?

LinguList commented 4 years ago

Array is not suitable. a) you cannot write easily, so you'd need a wrapper for writing anyway, as it does not define the customised __str__. b) complex datatypes, like lists, which are the core of partial cognate detection etc., are not possible.

xrotwang commented 4 years ago

Ok, I see.

LinguList commented 4 years ago

I just saw that the basictypes is badly documented.

LinguList commented 4 years ago

I thought I had done a rather thoughtful documentatio of the package inside lingpy, but well. They have 100% test coverage, so far, I think.

xrotwang commented 4 years ago

Ok, putting something like "ListOfFixedType" into linse makes sense, I guess. But are you aware of https://treyhunner.com/2019/04/why-you-shouldnt-inherit-from-list-and-dict-in-python/#Lists_and_sets_have_the_same_problem ?

Your implementation would suffer from the issues explained in this article. Inheriting from collections.UserList otoh would mean that isinstance(x, list) == False.

LinguList commented 4 years ago

Yes, but from practice in working with alignments, and the new algorithms, partial cognates, and the like, I can tell that a specific datatype is extremely important. The alternative is extremely tedious to program, and most check when dealing with sequences should anyway allow to account for iterables in general.

xrotwang commented 4 years ago

I did a bit more reading and it seems that your implementation should be fairly safe, considering that it only interferes with the append-type operations.

xrotwang commented 4 years ago

I'd call the base class Array, though, I guess.

LinguList commented 4 years ago

Good to hear. What I experienced problems with was direct filewriting, since I had to convert to string directly (I assume there's a __XYZ___ method in Python that applies something like __str__ upon formatting, which lacks implementation.

LinguList commented 4 years ago

Completely fine with me.

xrotwang commented 4 years ago

@LinguList What do you mean with filewriting? In JSON objects? Or when writing csv files?

LinguList commented 4 years ago

simple writing somethign to file. I mean: one reason I wanted to have the "ints" is that I can then print them out nicely when debugging, or working interactively, but another reason was: when opening a file and writing the data there, I don't have to write ' '.join([str(x) for x in arraywithints]). Now I can write str(arraywithints) (etc.), but even better would be to just pass the arraywithints directly. Or is this its "repr" function?

xrotwang commented 4 years ago

@LinguList and one more question: Wouldn't it be useful, if these types had a strict mode, where instead of casting we'd do

assert isinstance(x, self._type)

?

LinguList commented 4 years ago

Yes, this sounds very useful.

xrotwang commented 4 years ago

Oh, and another thing I find weird is the implicit whitespace splitting. I think it's very unintuitive if a class called strings can not contain arbitrary strings - but only strings without whitespace. It would be ok, I guess, if that class was called words.

LinguList commented 4 years ago

Yes, you are right. This is much more transparent.

xrotwang commented 4 years ago

I'd also leave the lists class in lingpy. This seems a bit too specific and somewhat intransparent.

LinguList commented 4 years ago

Lists is also the class that costs most problems in implementation, and its major purpose is to allow for partial cognate detection and the like, so one can argue that this is too specific, unless we find that there are many more similar use cases.

However, the datatype of a two-level segmentation is effectively a result of sequence manipulations: you read in a normal words instance and then infer where the major syllable peaks are, due to the prosody of the word, and receive a complexword instance (one with an additional layer of segmentation, like lists).

But we can just see where this all leads, when starting to port some of the functions.

xrotwang commented 4 years ago

Yes, I started porting ipa2tokens and asjp2tokens - will push later today - and already think that this is a good way forward. It's a lot simpler to have these rather lengthy and somewhat complex functions in a separate place, where it's also a lot easier to add tests.

LinguList commented 4 years ago

Yes, I agree. The current state of "sequence.sound_classes" only proves it was time to re-think these operations, and also decide e.g., which ones are actually needed, etc...

xrotwang commented 4 years ago

Another question: To move tokens2class we'd also need (parts of) the sound class models. And I think this wouldn't be a bad idea. What I would do though is simplify the loading of the converter. I'm fairly certain that recreating it from a text file isn't much slower than reading the pickle file from disk. So for the first step, the Model class would not have scorer and matrix, and thus would not need cache, pickling, etc. We can figure out later, if a lingpy.Model will simply inherit from linse.Model, adding these, or whether that goes into linse, too.

LinguList commented 4 years ago

Yes. I was already thinking about this, and I agree. We can also add more precise checks here.

I am wondering of how one could best represent the sound class converter-files. I have been reluctant of adding more data there, as they are quite inconvenient to edit. On the other hand, turning the structure around, like having sound-in-data in one column, and the conversion to sound classes in another column always seemed tedious and difficult to debug for me, yet now, I think one could even spare the checks, if we make a big table for all major sound class systems at once?

LinguList commented 4 years ago

So I mean:

sound	sca	asjp	dolgo	color
s	S	s	S	#ffffff
ts	C	ts	K	#hhhhhh

etc.

xrotwang commented 4 years ago

We are getting weird dependencies here, though, right? Ideally, this table should come from CLTS, I guess.

LinguList commented 4 years ago

I also thought about this. BUT clts sound classes were produce with lingpy... they are not curated in the same way, and we need much fewer here...

xrotwang commented 4 years ago

Ok, but then I'd leave at least the directory structure as is. It makes for a better API, I think, if a model is basically defined by a directory. We might change the format of the converter, though, if you find something else more suitable.

xrotwang commented 4 years ago

I also just looked at the size of scorer and matrix, and it doesn't seem super big. So if we add these, I'd actually add them to the package, and just make recreation of these a step in the release procedure.

LinguList commented 4 years ago

Yes, I agree. We can also leave the old format. We have checks that make sure that ALL of those characters in there re-appear in all major sound classes in lingpy.

The advantage is also that custom sound classes can still be used, if needed (although there's not much interest in this).

xrotwang commented 4 years ago

Just pushed the first code and I really like it; in particular the fact that function signatures are much more predictable than before, e.g.

all public functions in linse.segment take a "word" as argument and return a sequence,
all public functions in linse.annotate take a sequence of tokens as argument and return a sequence of the same length, typed via linse.listtypes.

Everything seems just a lot more orderly, and all the messy handling of multi-word input, casting between list and string, etc. can be handled by the wrapper in lingpy.

LinguList commented 4 years ago

Perfect, also the fact that this is so lightweight, and works with clldutils as single dependency.

LinguList commented 4 years ago

@xrotwang, do you in principle agree with adding the "manipulate" for all those cases, where a sequence is converted to another sequence (different length) etc.? In that case, I'd like to use that to propose revised methods for segmenting words into syllables based on prosody. I think they can be done more efficiently than it was done in the past, and may come in handy for lexibank, in case we want to calculate prosodic patterns, and the like.

Ah, and the code for making an initial profile out of data in a CSV file: would it also be the place to put it here? In a profile module?

xrotwang commented 4 years ago

Yes to both, I think.

Johann-Mattis List notifications@github.com schrieb am Fr., 24. Apr. 2020, 10:49:

@xrotwang https://github.com/xrotwang, do you in principle agree with adding the "manipulate" for all those cases, where a sequence is converted to another sequence (different length) etc.? In that case, I'd like to use that to propose revised methods for segmenting words into syllables based on prosody. I think they can be done more efficiently than it was done in the past, and may come in handy for lexibank, in case we want to calculate prosodic patterns, and the like.

Ah, and the code for making an initial profile out of data in a CSV file: would it also be the place to put it here? In a profile module?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/issues/4#issuecomment-618888295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKAYYQWDXIOQGI2ZKPTROFHCPANCNFSM4MO5HCDQ .

LinguList commented 4 years ago

Okay, profiles are in #5, and the manipulate module is already discussed in #3 ("transform" maybe even a better term: linse.transform.syllabify...)

lingpy / linse

sequence types #4