Open shuttle1987 opened 6 years ago
That call to construct the class is particularly problematic, label_segmenter
simply cannot be None
here and hence the default parameter seems broken.
What I did in #147 was to throw an exception if the label_segmenter
is None. While it makes the failure more obvious this is not a fix, what should the default behavior be if label_segmenter
is not provided?
We could make some sort of default label segmenter function here but what would that be?
@oadams is default to segmenting on characters is the best default strategy?
Yes, definitely.
In the interest of getting Corpus.from_elan()
support integrated into downstream UI, the LabelSegmenter
is something that needs to be dealt with. Here are some thoughts:
segment_into_chars
is 100% the sensible default.LabelSegmenter
object which I initially created is overboard. Probably better to just pass a function to the constructor, which defaults to segment_into_chars
, but could also be segment_into_tokens
with an associated label set.LabelSegmenter
(which contains a label set and the segmentation function): Corpus.from_elan()
could use label_segmenter.labels
as the argument to Corpus.__init__()
. Without it the labels need to be separately inferred from the segmented utterances when Corpus.__init__()
gets called from Corpus.from_elan()
. This is slightly less efficient but who cares, it's simpler.So I'm just looking at the interface now and there's the following mismatch:
@classmethod
def from_elan(cls: Type[CorpusT], org_dir: Path, tgt_dir: Path,
feat_type: str = "fbank", label_type: str = "phonemes",
utterance_filter: Callable[[Utterance], bool] = None,
label_segmenter: Optional[LabelSegmenter] = None,
speakers: List[str] = None, lazy: bool = True,
tier_prefixes: Tuple[str, ...] = ("xv", "rf")) -> CorpusT:
Then in the implementation of this we have:
# This currently bails out if label_segmenter is not provided
if not label_segmenter:
raise ValueError("A label segmenter must be provided via label_segmenter")
I remember adding this in as a temporary workaround but it's been a long time and it's still there. This line of code directly contradicts the type signature that says that the label_segmenter
is optional.
The ergonomics of using the LabelSegmenter
over a bare function call seems reasonable if we need to ensure the context (in the form of providing a label set) is available for segmenting by the segmentation function being called by making everything be passed as a single parameter.
Alternatively we can deal with this in a plain function call but we will then have to deal with different functions having different numbers of parameters. For example we may wish to change the signature of segement_into_chars
to take kwargs
so that the call sites don't get more complex, or alternatively we would have to have a completely separate path to handle this case for functions that don't receive a label set.
The connection between the segmenting function and the label set is a case of essential complexity though, we will have to handle it somewhere*.
As for the default behavior what do you think of the following handling in the case where label_segmenter
is not passed in:
segment_into_tokens
using the label setsegment_into_chars
(*As a sidenote I liked the overall idea with LabelSegmenter
because it groups the data with the function, I would have used a class instead of a NamedTuple
for this implementation, using the Python data model with a class and possibly overriding __call__
could make the call sites simplest.)
As for the default behavior what do you think of the following handling in the case where label_segmenter is not passed in:
If a label set is provided segment on tokens with segment_into_tokens using the label set If no label set is provided use segment_into_chars
Agreed that this is sensible default behaviour. However, this would require a labels
optional argument for the label set, in addition to the label_segmenter
, which may be a bit overwhelming. If we were to do it, perhaps the best thing to do is remove the label_segmenter
argument and just force the option of either segmenting on chars or greedy segmentation with the supplied token set based on the user supplying labels
or not. This precludes users from supplying weird and wonderful LabelSegmenter
s but that's going to be an uncommon case. Or we could leave both arguments and just accept that the classmethod kwargs are getting clunky.
A concern that pops into mind is that in following this default behaviour, an inconsistency arises between Corpus.__init__()
and Corpus.from_elan()
. Corpus.__init__()
expects that the text was already segmented with spaces and just complains if there is an inconsistency when labels
is passed in. In contrast, Corpus.from_elan()
would actually perform the segmentation on the basis of the supplied labels
. This makes behaviour more confusing since they use labels
differently. Also, given that there may be a variety of Corpus.from_*
classmethods dealing with different input formats, perhaps Corpus.__init__()
should actually handle this logic in a single place, rather than it being replicated or handled differently between classmethods and __init__()
. But then Corpus.__init__()
would need to behave differently since it currently splits on whitespace.
If you agree I'd say the best thing to do for now is remove the label_segmenter
argument in Corpus.from_elan()
and add a labels
argument with your proposed logic. Then we just accept (a) users can't use LabelSegmenters
and (b) that the role of labels
is different between __init__()
and the from_elan()
classmethod until we're sure we want to break backwards compatibility (in which case we'd be changing how __init__()
uses the labels
argument as its pretty superfluous for now).
I've been thinking about this a bit more and I think there's a really nice way to keep the calling interface the same for label_segmenter
and keep backwards compatibility.
The Python data model essentially means that functions and objects have the same interface via __call__
this means that we can create a Callable
via either a plain function or a class that has the __call__
dunder defined.
So a plain function that has no context can just be passed in for segmentation. Other functions that require context can be provided in a class that invokes the needed functionality with the context via the __call__
then at all the calling sites it looks as though there's just the same type of call. It might be a little clunky if inside places like from_elan
we had to set the labels there but as far as I know this isn't the use case we actually have (we could check for the presence of a labels attribute with something like getattr in the case that we had to do something conditional upon that .labels
existing).
Corpus.from_elan
has a parameter that will allow you to specify a label segmenter which is of typeLabelSegmenter
Usages:
and
This will fail with the default parameter since
None
has no membersegment_labels
.This is a situation where a unit test case would be particularly valuable (this relates to #150). Perhaps a needs test label in GitHub has some value?