Corpus.from_elan fails with default argument for label_segmenter

shuttle1987 commented 6 years ago

Corpus.from_elan has a parameter that will allow you to specify a label segmenter which is of type LabelSegmenter

Usages:

        utterances = [label_segmenter.segment_labels(utter) for utter in utterances]

and

        corpus = cls(feat_type, label_type, tgt_dir,
                     label_segmenter.labels, speakers=speakers)

This will fail with the default parameter since None has no member segment_labels.

This is a situation where a unit test case would be particularly valuable (this relates to #150). Perhaps a needs test label in GitHub has some value?

shuttle1987 commented 6 years ago

That call to construct the class is particularly problematic, label_segmenter simply cannot be None here and hence the default parameter seems broken.

shuttle1987 commented 6 years ago

What I did in #147 was to throw an exception if the label_segmenter is None. While it makes the failure more obvious this is not a fix, what should the default behavior be if label_segmenter is not provided?

We could make some sort of default label segmenter function here but what would that be?

shuttle1987 commented 6 years ago

@oadams is default to segmenting on characters is the best default strategy?

oadams commented 6 years ago

Yes, definitely.

oadams commented 5 years ago

In the interest of getting Corpus.from_elan() support integrated into downstream UI, the LabelSegmenter is something that needs to be dealt with. Here are some thoughts:

Character segmentation via segment_into_chars is 100% the sensible default.
The need for a LabelSegmenter object which I initially created is overboard. Probably better to just pass a function to the constructor, which defaults to segment_into_chars, but could also be segment_into_tokens with an associated label set.
There was a reason for having LabelSegmenter (which contains a label set and the segmentation function): Corpus.from_elan() could use label_segmenter.labels as the argument to Corpus.__init__(). Without it the labels need to be separately inferred from the segmented utterances when Corpus.__init__() gets called from Corpus.from_elan(). This is slightly less efficient but who cares, it's simpler.

shuttle1987 commented 5 years ago

So I'm just looking at the interface now and there's the following mismatch:

    @classmethod
    def from_elan(cls: Type[CorpusT], org_dir: Path, tgt_dir: Path,
                  feat_type: str = "fbank", label_type: str = "phonemes",
                  utterance_filter: Callable[[Utterance], bool] = None,
                  label_segmenter: Optional[LabelSegmenter] = None,
                  speakers: List[str] = None, lazy: bool = True,
                  tier_prefixes: Tuple[str, ...] = ("xv", "rf")) -> CorpusT:

Then in the implementation of this we have:

        # This currently bails out if label_segmenter is not provided
        if not label_segmenter:
            raise ValueError("A label segmenter must be provided via label_segmenter")

I remember adding this in as a temporary workaround but it's been a long time and it's still there. This line of code directly contradicts the type signature that says that the label_segmenter is optional.

The ergonomics of using the LabelSegmenter over a bare function call seems reasonable if we need to ensure the context (in the form of providing a label set) is available for segmenting by the segmentation function being called by making everything be passed as a single parameter.

Alternatively we can deal with this in a plain function call but we will then have to deal with different functions having different numbers of parameters. For example we may wish to change the signature of segement_into_chars to take kwargs so that the call sites don't get more complex, or alternatively we would have to have a completely separate path to handle this case for functions that don't receive a label set.

The connection between the segmenting function and the label set is a case of essential complexity though, we will have to handle it somewhere*.

As for the default behavior what do you think of the following handling in the case where label_segmenter is not passed in:

If a label set is provided segment on tokens with segment_into_tokens using the label set
If no label set is provided use segment_into_chars

(*As a sidenote I liked the overall idea with LabelSegmenter because it groups the data with the function, I would have used a class instead of a NamedTuple for this implementation, using the Python data model with a class and possibly overriding __call__ could make the call sites simplest.)

oadams commented 5 years ago

As for the default behavior what do you think of the following handling in the case where label_segmenter is not passed in:
If a label set is provided segment on tokens with segment_into_tokens using the label set
If no label set is provided use segment_into_chars

Agreed that this is sensible default behaviour. However, this would require a labels optional argument for the label set, in addition to the label_segmenter, which may be a bit overwhelming. If we were to do it, perhaps the best thing to do is remove the label_segmenter argument and just force the option of either segmenting on chars or greedy segmentation with the supplied token set based on the user supplying labels or not. This precludes users from supplying weird and wonderful LabelSegmenters but that's going to be an uncommon case. Or we could leave both arguments and just accept that the classmethod kwargs are getting clunky.

A concern that pops into mind is that in following this default behaviour, an inconsistency arises between Corpus.__init__() and Corpus.from_elan(). Corpus.__init__() expects that the text was already segmented with spaces and just complains if there is an inconsistency when labels is passed in. In contrast, Corpus.from_elan() would actually perform the segmentation on the basis of the supplied labels. This makes behaviour more confusing since they use labels differently. Also, given that there may be a variety of Corpus.from_* classmethods dealing with different input formats, perhaps Corpus.__init__() should actually handle this logic in a single place, rather than it being replicated or handled differently between classmethods and __init__(). But then Corpus.__init__() would need to behave differently since it currently splits on whitespace.

If you agree I'd say the best thing to do for now is remove the label_segmenter argument in Corpus.from_elan() and add a labels argument with your proposed logic. Then we just accept (a) users can't use LabelSegmenters and (b) that the role of labels is different between __init__() and the from_elan() classmethod until we're sure we want to break backwards compatibility (in which case we'd be changing how __init__() uses the labels argument as its pretty superfluous for now).

shuttle1987 commented 5 years ago

I've been thinking about this a bit more and I think there's a really nice way to keep the calling interface the same for label_segmenter and keep backwards compatibility.

The Python data model essentially means that functions and objects have the same interface via __call__ this means that we can create a Callable via either a plain function or a class that has the __call__ dunder defined.

So a plain function that has no context can just be passed in for segmentation. Other functions that require context can be provided in a class that invokes the needed functionality with the context via the __call__ then at all the calling sites it looks as though there's just the same type of call. It might be a little clunky if inside places like from_elan we had to set the labels there but as far as I know this isn't the use case we actually have (we could check for the presence of a labels attribute with something like getattr in the case that we had to do something conditional upon that .labels existing).

persephone-tools / persephone

Corpus.from_elan fails with default argument for label_segmenter #151