New fixes and new module for subsequences

LinguList commented 9 months ago

@xrotwang, I realized that a typedsequence should return a typedsequence when slicing it, the same holds for words and morphemes. I don't know if my solution is correct here, but it works in the major tests.

I also refined the lookup for sound classes and CLTS (actually, we can unify clts and sound classes now): we distinguish strict and non-strict lookup, and non-strict lookup iterates over all substrings in a segment (in reverse order of length) and returns the sound class or clts sound for the best-matching substring. For this operation, I added a new module, subsequence, which provides one more function from lingpy that I wanted to transfer to linse anyway (now called suffixes before get_all_ngrams).

LinguList commented 9 months ago

I added new fixes now, specifically also to the segment function that segments by substrings. This function now works also for tuples, and typedsequences. I consider this an advantage over the string-based approach, since we may want to use the method for morpheme segmentation as well.

LinguList commented 9 months ago

The behavior for words now is that an empty word plus an empty word yields an empty word, but adding a non-empty word to an empty word yields the non-empty word, etc. We had unexpected behavior before I added these fixes.

xrotwang commented 9 months ago

@LinguList We should definitely go through the list of all __*__ methods to check if we customized all relevant ones. Otherwise we might get something half-done with weird behaviour in edge cases.

xrotwang commented 9 months ago

The behavior for words now is that an empty word plus an empty word yields an empty word, but adding a non-empty word to an empty word yields the non-empty word, etc. We had unexpected behavior before I added these fixes.

So the empty word is considered the "null" word? Do we also filter out null words when initializing a sequence?

xrotwang commented 9 months ago

Thinking about the null word feature, couldn't this be problematic? E.g. if you have things like IGT, with correspondences between words in two sequences. It would seem that distinguishing between "empty word" and None is important.

xrotwang commented 9 months ago

Thinking about the null word feature, couldn't this be problematic? E.g. if you have things like IGT, with correspondences between words in two sequences. It would seem that distinguishing between "empty word" and None is important.

Or - as with IGT - we could replace (or just represent) empty words with ∅?

LinguList commented 9 months ago

@xrotwang, the null-word behaviour was something I detected when using the Word class and the Morpheme class for concrete tasks. So judging from these tasks, as the slight modification to the convert method that now allows to use Morphemes (not Words, due to their extend-method being different) instead of strings, the null-word is something that is useful.

How to initialize a null-word is another question.

I am not sure I understand your point regarding empty word and None: would you mean we should have a specific word that is specifically marked as an empty word and distinguished from constructs like:

Word('')
w = Word("a")
w[:0]

Because these two, specifically the slicing behavior, are important for the extended code on suffixes and conversion with conversion tables.

LinguList commented 9 months ago

@LinguList We should definitely go through the list of all * methods to check if we customized all relevant ones. Otherwise we might get something half-done with weird behaviour in edge cases.

The current fixes already deal with weird edge cases. I do not consider this final. We really must check the behaviour very carefully.

xrotwang commented 9 months ago

On second thought, empty words do not make sense, I guess. If we accept sequences of words separated by whitespace, there's really no place where an empty string may appear. So if a sequence of words is initialized from a list of already split words, I'd consider empty words an input error. Whether silently dropping them is the right way to deal with this, I don't know.

LinguList commented 9 months ago

Well, if we take the string.split() function here as the basis, it is exactly what you expect to get if you split something that is an empty string:

"".split()

And programmatically it makes sense, not necessarily for the Word class, but for the Morpheme class. And here, the empty Morpheme (or whatever you call it) is very useful, since we can start from there to derive all suffixes and even do conversions on the level of segmented strings.

If we raise errors for these cases, we must add custom functions for these cases.

LinguList commented 9 months ago

But right now, I can do the following:

m = Morpheme("a b c d e f g")
segment(m, {Morpheme("a b"), Morpheme("c d e f g")})

And this is very useful.

LinguList commented 9 months ago

The Word class in any case has several inconsistencies still, so this really needs some more work.

>>> w = Word("a +")
>>> w.morphemes
["a +"]

Of course, we would rather only like ["a"] to be the morpheme here, not a +. But the sep=" + " prevents matching final +.

LinguList commented 9 months ago

>>> w = Word("a +", sep="+")
>>> w.morphemes
['a', '']
>>> list(w)
['a', '+"]
>>> w = Word("a+", sep="+")
>>> w.morphemes
['a', '']
>>> list(w)
['a+']

LinguList commented 9 months ago

Does not make sense. An empty morpheme on the contrary makes sense to me (as makes an empty typedsequence, since we are dealing with a list of elements of a specific type, and we also want the list itself to be hashable. We can also imagine an empty morpheme in linguistics, so this is also in line with some theories.

xrotwang commented 9 months ago

I think we are still struggling with this dual nature "string and list". We should probably specify behaviour of only the list (of words) thing. Reading and writing lists of words from/to strings is a separate issue.

LinguList commented 9 months ago

Yes. So from reflecting about what I am testing at the moment, and what I think we would need when trying to provide enhanced functionality for the code in lingpy and lingrex, I can say that first: the typedsequence itself is an extremely useful thing, it is less Morpheme and Word that matters, but that I can specify a list that contains elements of a specific type. This list can also be empty, of course.

LinguList commented 9 months ago

Even for the Morpheme I am not that sure now. The only reason one cannot derive it with functools.partial is that we want to check of an item in the iterable contains a whitespace character, which would lead to strange behavior.

xrotwang commented 9 months ago

The only thing Word - as typed sequence of morphemes - adds onto TypedSequence is the separator +, i.e. a way to prepare the string input for its item class. But even just for that it may be worth it. At least one would have a good place to document "word behaviour". Word = functools.partial(TypedSequence, ...) doesn't have a docstring :)

xrotwang commented 9 months ago

I'll refactor TypedSequence in a new PR

lingpy / linse

New fixes and new module for subsequences #28