Closed LinguList closed 4 years ago
What's the difference between the basictypes and what can be achieved with array?
Array is not suitable. a) you cannot write easily, so you'd need a wrapper for writing anyway, as it does not define the customised __str__
. b) complex datatypes, like lists
, which are the core of partial cognate detection etc., are not possible.
Ok, I see.
I just saw that the basictypes is badly documented.
I thought I had done a rather thoughtful documentatio of the package inside lingpy, but well. They have 100% test coverage, so far, I think.
Ok, putting something like "ListOfFixedType" into linse
makes sense, I guess. But are you aware of https://treyhunner.com/2019/04/why-you-shouldnt-inherit-from-list-and-dict-in-python/#Lists_and_sets_have_the_same_problem ?
Your implementation would suffer from the issues explained in this article. Inheriting from collections.UserList
otoh would mean that isinstance(x, list) == False
.
Yes, but from practice in working with alignments, and the new algorithms, partial cognates, and the like, I can tell that a specific datatype is extremely important. The alternative is extremely tedious to program, and most check when dealing with sequences should anyway allow to account for iterables in general.
I did a bit more reading and it seems that your implementation should be fairly safe, considering that it only interferes with the append-type operations.
I'd call the base class Array
, though, I guess.
Good to hear. What I experienced problems with was direct filewriting,
since I had to convert to string directly (I assume there's a __XYZ___
method in Python that applies something like __str__
upon formatting,
which lacks implementation.
Completely fine with me.
@LinguList What do you mean with filewriting? In JSON objects? Or when writing csv files?
simple writing somethign to file. I mean: one reason I wanted to have the "ints" is that I can then print them out nicely when debugging, or working interactively, but another reason was: when opening a file and writing the data there, I don't have to write ' '.join([str(x) for x in arraywithints])
. Now I can write str(arraywithints)
(etc.), but even better would be to just pass the arraywithints
directly. Or is this its "repr" function?
@LinguList and one more question: Wouldn't it be useful, if these types had a strict
mode, where instead of casting we'd do
assert isinstance(x, self._type)
?
Yes, this sounds very useful.
Oh, and another thing I find weird is the implicit whitespace splitting. I think it's very unintuitive if a class called strings
can not contain arbitrary strings - but only strings without whitespace. It would be ok, I guess, if that class was called words
.
Yes, you are right. This is much more transparent.
I'd also leave the lists
class in lingpy
. This seems a bit too specific and somewhat intransparent.
Lists is also the class that costs most problems in implementation, and its major purpose is to allow for partial cognate detection and the like, so one can argue that this is too specific, unless we find that there are many more similar use cases.
However, the datatype of a two-level segmentation is effectively a result of sequence manipulations: you read in a normal words
instance and then infer where the major syllable peaks are, due to the prosody of the word, and receive a complexword
instance (one with an additional layer of segmentation, like lists
).
But we can just see where this all leads, when starting to port some of the functions.
Yes, I started porting ipa2tokens
and asjp2tokens
- will push later today - and already think that this is a good way forward. It's a lot simpler to have these rather lengthy and somewhat complex functions in a separate place, where it's also a lot easier to add tests.
Yes, I agree. The current state of "sequence.sound_classes" only proves it was time to re-think these operations, and also decide e.g., which ones are actually needed, etc...
Another question: To move tokens2class
we'd also need (parts of) the sound class models. And I think this wouldn't be a bad idea. What I would do though is simplify the loading of the converter. I'm fairly certain that recreating it from a text file isn't much slower than reading the pickle file from disk. So for the first step, the Model
class would not have scorer and matrix, and thus would not need cache, pickling, etc. We can figure out later, if a lingpy.Model
will simply inherit from linse.Model
, adding these, or whether that goes into linse
, too.
Yes. I was already thinking about this, and I agree. We can also add more precise checks here.
I am wondering of how one could best represent the sound class converter-files. I have been reluctant of adding more data there, as they are quite inconvenient to edit. On the other hand, turning the structure around, like having sound-in-data in one column, and the conversion to sound classes in another column always seemed tedious and difficult to debug for me, yet now, I think one could even spare the checks, if we make a big table for all major sound class systems at once?
So I mean:
sound | sca | asjp | dolgo | color |
---|---|---|---|---|
s | S | s | S | #ffffff |
ts | C | ts | K | #hhhhhh |
etc.
We are getting weird dependencies here, though, right? Ideally, this table should come from CLTS, I guess.
I also thought about this. BUT clts sound classes were produce with lingpy... they are not curated in the same way, and we need much fewer here...
Ok, but then I'd leave at least the directory structure as is. It makes for a better API, I think, if a model is basically defined by a directory. We might change the format of the converter
, though, if you find something else more suitable.
I also just looked at the size of scorer and matrix, and it doesn't seem super big. So if we add these, I'd actually add them to the package, and just make recreation of these a step in the release procedure.
Yes, I agree. We can also leave the old format. We have checks that make sure that ALL of those characters in there re-appear in all major sound classes in lingpy.
The advantage is also that custom sound classes can still be used, if needed (although there's not much interest in this).
Just pushed the first code and I really like it; in particular the fact that function signatures are much more predictable than before, e.g.
linse.segment
take a "word" as argument and return a sequence,linse.annotate
take a sequence of tokens as argument and return a sequence of the same length, typed via linse.listtypes
.Everything seems just a lot more orderly, and all the messy handling of multi-word input, casting between list and string, etc. can be handled by the wrapper in lingpy
.
Perfect, also the fact that this is so lightweight, and works with clldutils as single dependency.
@xrotwang, do you in principle agree with adding the "manipulate" for all those cases, where a sequence is converted to another sequence (different length) etc.? In that case, I'd like to use that to propose revised methods for segmenting words into syllables based on prosody. I think they can be done more efficiently than it was done in the past, and may come in handy for lexibank, in case we want to calculate prosodic patterns, and the like.
Ah, and the code for making an initial profile out of data in a CSV file: would it also be the place to put it here? In a profile module?
Yes to both, I think.
Johann-Mattis List notifications@github.com schrieb am Fr., 24. Apr. 2020, 10:49:
@xrotwang https://github.com/xrotwang, do you in principle agree with adding the "manipulate" for all those cases, where a sequence is converted to another sequence (different length) etc.? In that case, I'd like to use that to propose revised methods for segmenting words into syllables based on prosody. I think they can be done more efficiently than it was done in the past, and may come in handy for lexibank, in case we want to calculate prosodic patterns, and the like.
Ah, and the code for making an initial profile out of data in a CSV file: would it also be the place to put it here? In a profile module?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/issues/4#issuecomment-618888295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKAYYQWDXIOQGI2ZKPTROFHCPANCNFSM4MO5HCDQ .
Okay, profiles are in #5, and the manipulate module is already discussed in #3 ("transform" maybe even a better term: linse.transform.syllabify
...)
It would be desirable to also transfer lingpy's basictypes structure to linse and use the time to also enhance it. The types ints, strings, and lists (and potentially alignments), or even a specific type for inter-linear-glossed text, which one might want to add, serve to store more complex sequence in sequential form, with a fixed datatype applying to each segment. Their
__str__
method converts them to the form in which they are usually stored in our text files.+
in addition to whitespace).ints, floats, strings, and lists are already essential for lingpy's algorithms and mostly applied there.