Closed LinguList closed 4 years ago
Looks good. While this would add complexity, it may be useful to add a segments
property to Form
to make it possible to use custom segmented data with the function. It feels a bit weird to always call segment.ipa
in the function, when there are potentially many more segmenters out there. Or maybe the function should already expect segmented data - considering that most of its arguments are simply passed through to ipa
?
Yes, good idea. I'll do this right away. And we can test by calling "list" as segments function.
BTW: I also modified the behavior of linse.segments.ipa
to raise a ValueError when being given an empty string or strings with whitespace at the end or the beginning.
Ah, just saw I misunderstood this. I thought of passing a function argument to get_profile, so segmentation could be done with that function. Or is this in any way problematic?
so the call would be get_profile(*forms, segment=ipa)
Note that we also have a specific error handling, which should be included in the output of get_profile, namely those ValueError cases, we have no exclusively in the segment.ipa
function. In ipa2tokens
, we still have other errors, such as IndexError, e.g., calling ipa2tokens('')
.
So it think, if we think of the input as a list of forms from a csv file (like forms.csv
), it may be better to make the error collection within this round, and we could even store the errors, to make it more explicit? (by now, we have only 1 for okay and 0 for IndexError)
And even if this is less explicit, we may want to add **kw
to get_profile, and also to the segment
function internally, to allow for different arguments when the segment function comes from another provider?
But if we pass the segmentation function we also need to pass all arguments, which seems not very transparent. I'd rather pass in segmented data, so maybe rather have a class Sequence rather than Form. Actually, considering the scope of linse, have a class Sequence which allows adding metadata to a list would be a good idea anyway?
Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:06:
so the call would be get_profile(*forms, segments=ipa)
Note that we also have a specific error handling, which should be included in the output of get_profile, namely those ValueError cases, we have no exclusively in the segment.ipa function. In ipa2tokens, we still have other errors, such as IndexError, e.g., calling ipa2tokens('').
So it think, if we think of the input as a list of forms from a csv file (like forms.csv), it may be better to make the error collection within this round, and we could even store the errors, to make it more explicit? (by now, we have only 1 for okay and 0 for IndexError)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620701226, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKE4VGX7YT3VAK5K3DDRO35GZANCNFSM4MS35ZBA .
The problem is only the error handling, as this was the most tedious thing, and one should not ignore errors from the profiles. So if I segment my data now before, by using a class Sequence, and then, some fail, how do I get this information into my profile? The desired behavior would be for me to have these failing sequences to be placed as is into the profile and treated as a single segment. In this way, they can later also added to "lexemes.tsv", etc. So would Sequence then take the segmenter function as argument?
Thinking about it this way, a profile isn't much more than an analysis of a list of lists - and could even be conceived as an analysis of a single - concatenated - list of segments.
Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:10:
But if we pass the segmentation function we also need to pass all arguments, which seems not very transparent. I'd rather pass in segmented data, so maybe rather have a class Sequence rather than Form. Actually, considering the scope of linse, have a class Sequence which allows adding metadata to a list would be a good idea anyway?
Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:06:
so the call would be get_profile(*forms, segments=ipa)
Note that we also have a specific error handling, which should be included in the output of get_profile, namely those ValueError cases, we have no exclusively in the segment.ipa function. In ipa2tokens, we still have other errors, such as IndexError, e.g., calling ipa2tokens('').
So it think, if we think of the input as a list of forms from a csv file (like forms.csv), it may be better to make the error collection within this round, and we could even store the errors, to make it more explicit? (by now, we have only 1 for okay and 0 for IndexError)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620701226, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKE4VGX7YT3VAK5K3DDRO35GZANCNFSM4MS35ZBA .
Adding unsegmentable forms as 1-element sequences, with an error message in the metadata, should do the trick, no?
Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:16:
Thinking about it this way, a profile isn't much more than an analysis of a list of lists - and could even be conceived as an analysis of a single - concatenated - list of segments.
Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:10:
But if we pass the segmentation function we also need to pass all arguments, which seems not very transparent. I'd rather pass in segmented data, so maybe rather have a class Sequence rather than Form. Actually, considering the scope of linse, have a class Sequence which allows adding metadata to a list would be a good idea anyway?
Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:06:
so the call would be get_profile(*forms, segments=ipa)
Note that we also have a specific error handling, which should be included in the output of get_profile, namely those ValueError cases, we have no exclusively in the segment.ipa function. In ipa2tokens, we still have other errors, such as IndexError, e.g., calling ipa2tokens('').
So it think, if we think of the input as a list of forms from a csv file (like forms.csv), it may be better to make the error collection within this round, and we could even store the errors, to make it more explicit? (by now, we have only 1 for okay and 0 for IndexError)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620701226, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKE4VGX7YT3VAK5K3DDRO35GZANCNFSM4MS35ZBA .
Yes, we can do it that way, I just thought it would be easier to have it done by one function, and not be forced to do it before. But one big list of segments is not possible when context comes into play.
If it is one big iterable of Sequence objects, it could work, because context could be inferred from the metadata.
Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:22:
Yes, we can do it that way, I just thought it would be easier to have it done by one function, and not be forced to do it before. But one big list of segments is not possible when context comes into play.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620711382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKB3PCMBTAWSSQOBIJLRO37EDANCNFSM4MS35ZBA .
A profile would then be not much more than a glorified Counter. But I'd actually like this - it would be a counter with particular semantics.
Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:28:
If it is one big iterable of Sequence objects, it could work, because context could be inferred from the metadata.
Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:22:
Yes, we can do it that way, I just thought it would be easier to have it done by one function, and not be forced to do it before. But one big list of segments is not possible when context comes into play.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620711382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKB3PCMBTAWSSQOBIJLRO37EDANCNFSM4MS35ZBA .
Much like a Sequence is a glorified list.
Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:34:
A profile would then be not much more than a glorified Counter. But I'd actually like this - it would be a counter with particular semantics.
Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:28:
If it is one big iterable of Sequence objects, it could work, because context could be inferred from the metadata.
Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:22:
Yes, we can do it that way, I just thought it would be easier to have it done by one function, and not be forced to do it before. But one big list of segments is not possible when context comes into play.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620711382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKB3PCMBTAWSSQOBIJLRO37EDANCNFSM4MS35ZBA .
Yes, in the end it is a counter. Tracing errors can be done before or during the profile creation. So would we call it just Profile
and make a class? Should I just let you do a pass on it, and we drop this PR? The test cases may be useful, and I can add more later.
Just getting back to this, if a Sequence is a segmented string (our TypedSequence) with metadata (language, etc.), we could add a from_text
method, that takes a segmentation function as argument, defaulting to segment.ipa
. If this fails, the Sequence would still be a sequence, but similar to pyclts, we could assign it the type "unsegmentable"?
If one then passes the Sequences to a Profile (or a DraftProfile
?), the profile does the counting work.
Ah, and the profile has one more task: it can also do the counting on annotated sequences, annotated by sound classes, if they correspond to bipa, and the like. This also needs to happen at some point. While the conversion to sound classes can be done on a per-segment basis, the annotation for prosodic-structure needs the sequence in its entirety. So I wonder: should a Sequence be able to store multiple versions of a text (segmented, bipa, sound classes) from the beginning? This might come in handy for sequence comparison, where this is done implicitly so far in lingpy and lots of the energy is devoted to restoring alignments at several levels, calling functions like class2tokens
, etc.
But from_text
does not make sense for other TypedSequence
subclasses, e.g. ints
. I think it really is more explicit if segmentation is done by the caller. This "inversion of control" (by passing a function to another function and have it called there) does not gain us anything AFAICS.
Let me do a bit more thinking. Right now, I'm leaning towards adding a properties
attribute to TypedSequence
which can be used to aggregate metadata during various processing steps. But potentially, this opens up another somewhat hidden box where complexity and unspecified interfaces may aggregate.
So, what information do we need in the profile about errors or unsegmentable data? Wouldn't it be enough if we pass unsegmentable forms as one-element sequences? I.e. say "abcdefg"
is unsegmentable, then we pass it to profile
as TypedSequence(str, ["abcdefg"])
.
Okay, but then I don't see why to make the whole thing about a Sequence anyway. If we want to feed lists and make a Counter, we can just do that and leave it to the users to decide how they segment.
But the general usecases for creating custom profiles are the following (and they are frequent, specifically in hacking on a little big of code):
Desired behavior:
forms.csv
, the Value is crucial to make the lexemes.tsv
, so this information is even more important when facing errors)And then, there are the specific cases of higher complexity, which are important for datasets where we can reach a higher level of accuracy, like:
I see a major service of the function to try and segment what it can segment, to make educated guesses with additional keywords that would usually not be used when doing a simple call of ipa2tokens
. So there is an educated guessing on some random test going on that yields a first draft segmentation. Saying that users should segment the data themselves in some way, using some function does not really help here, as this is the major service, that the profile creation process is supposed to provide.
If this is supposed to be kept flexible, allowing for other segmentation possibilities, there will be the point where a function needs to be passed that decides which educated guessing takes place. And it would be good to have that inside, not only exposed via a call to commandline, to allow for a quick checking of wordlist data, when doing cognate detection analysis and other things.
So, what information do we need in the profile about errors or unsegmentable data? Wouldn't it be enough if we pass unsegmentable forms as one-element sequences? I.e. say "abcdefg" is unsegmentable, then we pass it to profile as TypedSequence(str, ["abcdefg"]).
Essentially, we only need to have unsegmentable things exposed. When trying to create a profile from a list of strings (which is the normal usecase), we'd have all failures exposed, such as empty strings, strings with spaces (which would need pre-processing), etc. If my strings are already sequences and readily tokenized, there's not need to call the draft-profile function.
Ok, that use case description was useful. So it seems
dict
s returned in Python.So then, I guess, passing in a segmenter
function defaulting to segment.ipa
and maybe segmenter_kw
would make sense. And then return just return a pathlib.Path
to the written profile?
Yes. This sounds good to me.
Two complications (which may demand an intermediate function, i.e., the counter, as I proposed it, but I am open for alternatives as I am not sure about all this now) arise:
if kw['language']
.So we'd have: get_profiles(*strings, segmenter=segments.ipa)
, returning a dict of lists, that could be filtered for writing, and a writer function that could take a couple of filters, e.g., by language, and also the file name?
Context should be supported, I think, but no language selection. I.e. the profile creation shouldn't rely on any particular metadata.
Okay, so language selection is what we do before.
Okay, I updated this now, and I hope it will be straightforward to add the functions that write a profile from this basis.
@xrotwang, I was just thinking, that it would still probably be best to make the profile a class. In this way, one could instantiate a draft profile profile = DraftProfile(segmenter=linse.ipa, preceding='^', following='$')
, but could later add text to it: profile.add(new_text_or_forms)
. The advantage would be that one could read in a csv file and add text while doing so. And a DraftProfile.write()
function could then handle the writing.
I agree. DraftProfile
even makes the intent more clear. And since this is end-user convenience functionality anyway, I wouldn't insist on the clean functional architecture approach as for the other modules.
Okay, I'll see that I find time to modify the current version then and point you once I have this done (maybe not today). I could maybe just start without the writing function.
The new function is extremely simple, maybe it can even be made more simple, but the major point is:
Form
class (like language, ID, etc.) or they will be represented as the number of occurrence in the list of forms given to the methodget_profile
.I predefined three desired target functions (discussable):
simple
andcomplex
should work with theget_profile
code, forstructured
, I am not yet sure.Keywords, like selection of concepts, languages, etc., are now all handled before passing the forms to the profile, as discussed before.