Closed LinguList closed 9 months ago
I added new fixes now, specifically also to the segment function that segments by substrings. This function now works also for tuples, and typedsequences. I consider this an advantage over the string-based approach, since we may want to use the method for morpheme segmentation as well.
The behavior for words now is that an empty word plus an empty word yields an empty word, but adding a non-empty word to an empty word yields the non-empty word, etc. We had unexpected behavior before I added these fixes.
@LinguList We should definitely go through the list of all __*__
methods to check if we customized all relevant ones. Otherwise we might get something half-done with weird behaviour in edge cases.
The behavior for words now is that an empty word plus an empty word yields an empty word, but adding a non-empty word to an empty word yields the non-empty word, etc. We had unexpected behavior before I added these fixes.
So the empty word is considered the "null" word? Do we also filter out null words when initializing a sequence?
Thinking about the null word feature, couldn't this be problematic? E.g. if you have things like IGT, with correspondences between words in two sequences. It would seem that distinguishing between "empty word" and None
is important.
Thinking about the null word feature, couldn't this be problematic? E.g. if you have things like IGT, with correspondences between words in two sequences. It would seem that distinguishing between "empty word" and
None
is important.
Or - as with IGT - we could replace (or just represent) empty words with ∅?
@xrotwang, the null-word behaviour was something I detected when using the Word class and the Morpheme class for concrete tasks. So judging from these tasks, as the slight modification to the convert
method that now allows to use Morphemes (not Words, due to their extend-method being different) instead of strings, the null-word is something that is useful.
How to initialize a null-word is another question.
I am not sure I understand your point regarding empty word and None: would you mean we should have a specific word that is specifically marked as an empty word and distinguished from constructs like:
Word('')
w = Word("a")
w[:0]
Because these two, specifically the slicing behavior, are important for the extended code on suffixes and conversion with conversion tables.
@LinguList We should definitely go through the list of all * methods to check if we customized all relevant ones. Otherwise we might get something half-done with weird behaviour in edge cases.
The current fixes already deal with weird edge cases. I do not consider this final. We really must check the behaviour very carefully.
On second thought, empty words do not make sense, I guess. If we accept sequences of words separated by whitespace, there's really no place where an empty string may appear. So if a sequence of words is initialized from a list of already split words, I'd consider empty words an input error. Whether silently dropping them is the right way to deal with this, I don't know.
Well, if we take the string.split()
function here as the basis, it is exactly what you expect to get if you split something that is an empty string:
"".split()
And programmatically it makes sense, not necessarily for the Word
class, but for the Morpheme
class. And here, the empty Morpheme (or whatever you call it) is very useful, since we can start from there to derive all suffixes and even do conversions on the level of segmented strings.
If we raise errors for these cases, we must add custom functions for these cases.
But right now, I can do the following:
m = Morpheme("a b c d e f g")
segment(m, {Morpheme("a b"), Morpheme("c d e f g")})
And this is very useful.
The Word class in any case has several inconsistencies still, so this really needs some more work.
>>> w = Word("a +")
>>> w.morphemes
["a +"]
Of course, we would rather only like ["a"]
to be the morpheme here, not a +
. But the sep=" + "
prevents matching final +
.
>>> w = Word("a +", sep="+")
>>> w.morphemes
['a', '']
>>> list(w)
['a', '+"]
>>> w = Word("a+", sep="+")
>>> w.morphemes
['a', '']
>>> list(w)
['a+']
Does not make sense. An empty morpheme on the contrary makes sense to me (as makes an empty typedsequence, since we are dealing with a list of elements of a specific type, and we also want the list itself to be hashable. We can also imagine an empty morpheme in linguistics, so this is also in line with some theories.
I think we are still struggling with this dual nature "string and list". We should probably specify behaviour of only the list (of words) thing. Reading and writing lists of words from/to strings is a separate issue.
Yes. So from reflecting about what I am testing at the moment, and what I think we would need when trying to provide enhanced functionality for the code in lingpy and lingrex, I can say that first: the typedsequence itself is an extremely useful thing, it is less Morpheme and Word that matters, but that I can specify a list that contains elements of a specific type. This list can also be empty, of course.
Even for the Morpheme
I am not that sure now. The only reason one cannot derive it with functools.partial
is that we want to check of an item in the iterable contains a whitespace character, which would lead to strange behavior.
The only thing Word
- as typed sequence of morphemes - adds onto TypedSequence
is the separator +
, i.e. a way to prepare the string input for its item class. But even just for that it may be worth it. At least one would have a good place to document "word behaviour". Word = functools.partial(TypedSequence, ...)
doesn't have a docstring :)
I'll refactor TypedSequence in a new PR
@xrotwang, I realized that a typedsequence should return a typedsequence when slicing it, the same holds for words and morphemes. I don't know if my solution is correct here, but it works in the major tests.
I also refined the lookup for sound classes and CLTS (actually, we can unify clts and sound classes now): we distinguish strict and non-strict lookup, and non-strict lookup iterates over all substrings in a segment (in reverse order of length) and returns the sound class or clts sound for the best-matching substring. For this operation, I added a new module, subsequence, which provides one more function from lingpy that I wanted to transfer to linse anyway (now called
suffixes
beforeget_all_ngrams
).