first attempt on profile functions

LinguList commented 4 years ago

The new function is extremely simple, maybe it can even be made more simple, but the major point is:

we represent each segment processed as a tuple of the segment, and 1 if it was parsed okay, 0 if not (no idea how to make this more practial)
the tuple links to a list of dictionaries, they can be the keywords of the Form class (like language, ID, etc.) or they will be represented as the number of occurrence in the list of forms given to the method get_profile.
keywords preceding and following are used to anticipate the context profiles: they are not part of the segmentation process, but need to be attached to word forms later, and since we lose context information after calling get_profile, we need to store this information already here

I predefined three desired target functions (discussable):

simple (forms are represented without context)
complex (forms are represented with their context)
structured (forms are yielded along with their prosodic structure, the profile will require post-processing on each form here).

simple and complex should work with the get_profile code, for structured, I am not yet sure.

Keywords, like selection of concepts, languages, etc., are now all handled before passing the forms to the profile, as discussed before.

xrotwang commented 4 years ago

Looks good. While this would add complexity, it may be useful to add a segments property to Form to make it possible to use custom segmented data with the function. It feels a bit weird to always call segment.ipa in the function, when there are potentially many more segmenters out there. Or maybe the function should already expect segmented data - considering that most of its arguments are simply passed through to ipa?

LinguList commented 4 years ago

Yes, good idea. I'll do this right away. And we can test by calling "list" as segments function.

BTW: I also modified the behavior of linse.segments.ipa to raise a ValueError when being given an empty string or strings with whitespace at the end or the beginning.

LinguList commented 4 years ago

Ah, just saw I misunderstood this. I thought of passing a function argument to get_profile, so segmentation could be done with that function. Or is this in any way problematic?

LinguList commented 4 years ago

so the call would be get_profile(*forms, segment=ipa)

Note that we also have a specific error handling, which should be included in the output of get_profile, namely those ValueError cases, we have no exclusively in the segment.ipa function. In ipa2tokens, we still have other errors, such as IndexError, e.g., calling ipa2tokens('').

So it think, if we think of the input as a list of forms from a csv file (like forms.csv), it may be better to make the error collection within this round, and we could even store the errors, to make it more explicit? (by now, we have only 1 for okay and 0 for IndexError)

LinguList commented 4 years ago

And even if this is less explicit, we may want to add **kw to get_profile, and also to the segment function internally, to allow for different arguments when the segment function comes from another provider?

xrotwang commented 4 years ago

But if we pass the segmentation function we also need to pass all arguments, which seems not very transparent. I'd rather pass in segmented data, so maybe rather have a class Sequence rather than Form. Actually, considering the scope of linse, have a class Sequence which allows adding metadata to a list would be a good idea anyway?

Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:06:

so the call would be get_profile(*forms, segments=ipa)

Note that we also have a specific error handling, which should be included in the output of get_profile, namely those ValueError cases, we have no exclusively in the segment.ipa function. In ipa2tokens, we still have other errors, such as IndexError, e.g., calling ipa2tokens('').

So it think, if we think of the input as a list of forms from a csv file (like forms.csv), it may be better to make the error collection within this round, and we could even store the errors, to make it more explicit? (by now, we have only 1 for okay and 0 for IndexError)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620701226, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKE4VGX7YT3VAK5K3DDRO35GZANCNFSM4MS35ZBA .

LinguList commented 4 years ago

The problem is only the error handling, as this was the most tedious thing, and one should not ignore errors from the profiles. So if I segment my data now before, by using a class Sequence, and then, some fail, how do I get this information into my profile? The desired behavior would be for me to have these failing sequences to be placed as is into the profile and treated as a single segment. In this way, they can later also added to "lexemes.tsv", etc. So would Sequence then take the segmenter function as argument?

xrotwang commented 4 years ago

Thinking about it this way, a profile isn't much more than an analysis of a list of lists - and could even be conceived as an analysis of a single - concatenated - list of segments.

Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:10:

But if we pass the segmentation function we also need to pass all arguments, which seems not very transparent. I'd rather pass in segmented data, so maybe rather have a class Sequence rather than Form. Actually, considering the scope of linse, have a class Sequence which allows adding metadata to a list would be a good idea anyway?

Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:06:

so the call would be get_profile(*forms, segments=ipa)

Note that we also have a specific error handling, which should be included in the output of get_profile, namely those ValueError cases, we have no exclusively in the segment.ipa function. In ipa2tokens, we still have other errors, such as IndexError, e.g., calling ipa2tokens('').

So it think, if we think of the input as a list of forms from a csv file (like forms.csv), it may be better to make the error collection within this round, and we could even store the errors, to make it more explicit? (by now, we have only 1 for okay and 0 for IndexError)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620701226, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKE4VGX7YT3VAK5K3DDRO35GZANCNFSM4MS35ZBA .

xrotwang commented 4 years ago

Adding unsegmentable forms as 1-element sequences, with an error message in the metadata, should do the trick, no?

Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:16:

Thinking about it this way, a profile isn't much more than an analysis of a list of lists - and could even be conceived as an analysis of a single - concatenated - list of segments.

Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:10:

But if we pass the segmentation function we also need to pass all arguments, which seems not very transparent. I'd rather pass in segmented data, so maybe rather have a class Sequence rather than Form. Actually, considering the scope of linse, have a class Sequence which allows adding metadata to a list would be a good idea anyway?

Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:06:

so the call would be get_profile(*forms, segments=ipa)

Note that we also have a specific error handling, which should be included in the output of get_profile, namely those ValueError cases, we have no exclusively in the segment.ipa function. In ipa2tokens, we still have other errors, such as IndexError, e.g., calling ipa2tokens('').

So it think, if we think of the input as a list of forms from a csv file (like forms.csv), it may be better to make the error collection within this round, and we could even store the errors, to make it more explicit? (by now, we have only 1 for okay and 0 for IndexError)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620701226, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKE4VGX7YT3VAK5K3DDRO35GZANCNFSM4MS35ZBA .

LinguList commented 4 years ago

Yes, we can do it that way, I just thought it would be easier to have it done by one function, and not be forced to do it before. But one big list of segments is not possible when context comes into play.

xrotwang commented 4 years ago

If it is one big iterable of Sequence objects, it could work, because context could be inferred from the metadata.

Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:22:

Yes, we can do it that way, I just thought it would be easier to have it done by one function, and not be forced to do it before. But one big list of segments is not possible when context comes into play.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620711382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKB3PCMBTAWSSQOBIJLRO37EDANCNFSM4MS35ZBA .

xrotwang commented 4 years ago

A profile would then be not much more than a glorified Counter. But I'd actually like this - it would be a counter with particular semantics.

Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:28:

If it is one big iterable of Sequence objects, it could work, because context could be inferred from the metadata.

Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:22:

Yes, we can do it that way, I just thought it would be easier to have it done by one function, and not be forced to do it before. But one big list of segments is not possible when context comes into play.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620711382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKB3PCMBTAWSSQOBIJLRO37EDANCNFSM4MS35ZBA .

xrotwang commented 4 years ago

Much like a Sequence is a glorified list.

Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:34:

A profile would then be not much more than a glorified Counter. But I'd actually like this - it would be a counter with particular semantics.

Robert Forkel xrotwang@googlemail.com schrieb am Di., 28. Apr. 2020, 18:28:

If it is one big iterable of Sequence objects, it could work, because context could be inferred from the metadata.

Johann-Mattis List notifications@github.com schrieb am Di., 28. Apr. 2020, 18:22:

Yes, we can do it that way, I just thought it would be easier to have it done by one function, and not be forced to do it before. But one big list of segments is not possible when context comes into play.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lingpy/linse/pull/8#issuecomment-620711382, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKB3PCMBTAWSSQOBIJLRO37EDANCNFSM4MS35ZBA .

LinguList commented 4 years ago

Yes, in the end it is a counter. Tracing errors can be done before or during the profile creation. So would we call it just Profile and make a class? Should I just let you do a pass on it, and we drop this PR? The test cases may be useful, and I can add more later.

LinguList commented 4 years ago

Just getting back to this, if a Sequence is a segmented string (our TypedSequence) with metadata (language, etc.), we could add a from_text method, that takes a segmentation function as argument, defaulting to segment.ipa. If this fails, the Sequence would still be a sequence, but similar to pyclts, we could assign it the type "unsegmentable"?

If one then passes the Sequences to a Profile (or a DraftProfile?), the profile does the counting work.

Ah, and the profile has one more task: it can also do the counting on annotated sequences, annotated by sound classes, if they correspond to bipa, and the like. This also needs to happen at some point. While the conversion to sound classes can be done on a per-segment basis, the annotation for prosodic-structure needs the sequence in its entirety. So I wonder: should a Sequence be able to store multiple versions of a text (segmented, bipa, sound classes) from the beginning? This might come in handy for sequence comparison, where this is done implicitly so far in lingpy and lots of the energy is devoted to restoring alignments at several levels, calling functions like class2tokens, etc.

xrotwang commented 4 years ago

But from_text does not make sense for other TypedSequence subclasses, e.g. ints. I think it really is more explicit if segmentation is done by the caller. This "inversion of control" (by passing a function to another function and have it called there) does not gain us anything AFAICS.

xrotwang commented 4 years ago

Let me do a bit more thinking. Right now, I'm leaning towards adding a properties attribute to TypedSequence which can be used to aggregate metadata during various processing steps. But potentially, this opens up another somewhat hidden box where complexity and unspecified interfaces may aggregate.

xrotwang commented 4 years ago

So, what information do we need in the profile about errors or unsegmentable data? Wouldn't it be enough if we pass unsegmentable forms as one-element sequences? I.e. say "abcdefg" is unsegmentable, then we pass it to profile as TypedSequence(str, ["abcdefg"]).

LinguList commented 4 years ago

Okay, but then I don't see why to make the whole thing about a Sequence anyway. If we want to feed lists and make a Counter, we can just do that and leave it to the users to decide how they segment.

But the general usecases for creating custom profiles are the following (and they are frequent, specifically in hacking on a little big of code):

inside Python, I access a list of unsegmented text and want to have an initial profile, and write that to text.
inside CLDF/lexibannk, I want to access a forms.csv, and here, I want to access the Form, and from this Form, I want to make an initial profile

Desired behavior:

the profile provides counts, to help me debug the data, plus some additional segmental data, so a dictionary seems like a good solution
the profile shows explicitly what could not be parsed, so that I can work on these elements and try debugging them (in fact: if we have the Value as well, from forms.csv, the Value is crucial to make the lexemes.tsv, so this information is even more important when facing errors)

And then, there are the specific cases of higher complexity, which are important for datasets where we can reach a higher level of accuracy, like:

some code tries to gues the prosodic structure when segmenting the data
this information is included into the profile, which will be much longer, but have another conversion target column, that will essentially provide the prosodic structure for each IPA sequence

I see a major service of the function to try and segment what it can segment, to make educated guesses with additional keywords that would usually not be used when doing a simple call of ipa2tokens. So there is an educated guessing on some random test going on that yields a first draft segmentation. Saying that users should segment the data themselves in some way, using some function does not really help here, as this is the major service, that the profile creation process is supposed to provide.

If this is supposed to be kept flexible, allowing for other segmentation possibilities, there will be the point where a function needs to be passed that decides which educated guessing takes place. And it would be good to have that inside, not only exposed via a call to commandline, to allow for a quick checking of wordlist data, when doing cognate detection analysis and other things.

LinguList commented 4 years ago

So, what information do we need in the profile about errors or unsegmentable data? Wouldn't it be enough if we pass unsegmentable forms as one-element sequences? I.e. say "abcdefg" is unsegmentable, then we pass it to profile as TypedSequence(str, ["abcdefg"]).

Essentially, we only need to have unsegmentable things exposed. When trying to create a profile from a list of strings (which is the normal usecase), we'd have all failures exposed, such as empty strings, strings with spaces (which would need pre-processing), etc. If my strings are already sequences and readily tokenized, there's not need to call the draft-profile function.

xrotwang commented 4 years ago

Ok, that use case description was useful. So it seems

this is a convenience functionality (so we shouldn't obsess about architectural purity)
output - also to a file - is essential to the functionality, because this is the output that will be inspected - rather than a list of dicts returned in Python.

So then, I guess, passing in a segmenter function defaulting to segment.ipa and maybe segmenter_kw would make sense. And then return just return a pathlib.Path to the written profile?

LinguList commented 4 years ago

Yes. This sounds good to me.

Two complications (which may demand an intermediate function, i.e., the counter, as I proposed it, but I am open for alternatives as I am not sure about all this now) arise:

we want to select by language, so we have per-language-profiles (we didn't expose this much, but it is useful). This can be most easily done by pre-selecting what you pass to the profile from the beginning, or by picking entries that have language specified in the metadata per sequence afterwards, but we may not want to have this inside a big profile function with if kw['language'].
the context aspect, i.e., placing context markers in the beginning and end is also a crucial aspect, where we could follow the strategy I proposed, but I again would be open for alternatives

So we'd have: get_profiles(*strings, segmenter=segments.ipa), returning a dict of lists, that could be filtered for writing, and a writer function that could take a couple of filters, e.g., by language, and also the file name?

xrotwang commented 4 years ago

Context should be supported, I think, but no language selection. I.e. the profile creation shouldn't rely on any particular metadata.

LinguList commented 4 years ago

Okay, so language selection is what we do before.

LinguList commented 4 years ago

Okay, I updated this now, and I hope it will be straightforward to add the functions that write a profile from this basis.

LinguList commented 4 years ago

@xrotwang, I was just thinking, that it would still probably be best to make the profile a class. In this way, one could instantiate a draft profile profile = DraftProfile(segmenter=linse.ipa, preceding='^', following='$'), but could later add text to it: profile.add(new_text_or_forms). The advantage would be that one could read in a csv file and add text while doing so. And a DraftProfile.write() function could then handle the writing.

xrotwang commented 4 years ago

I agree. DraftProfile even makes the intent more clear. And since this is end-user convenience functionality anyway, I wouldn't insist on the clean functional architecture approach as for the other modules.

LinguList commented 4 years ago

Okay, I'll see that I find time to modify the current version then and point you once I have this done (maybe not today). I could maybe just start without the writing function.

lingpy / linse

first attempt on profile functions #8