lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
16 stars 7 forks source link

Support grouping of sounds #271

Open FredericBlum opened 10 months ago

FredericBlum commented 10 months ago

As referenced in an example case, we should modify pylexibank in a way that allows us to support grouped sounds (e.g. a.ʔ) instead of reporting a transcription error for those cases.

LinguList commented 10 months ago

We'd need this ideally for Lexibank 2.0.

FredericBlum commented 4 months ago

One option would be to add the ungroup() function that we have used so far to pylexibank. Would that be a reasonable step? Or should we modify the check-up in a way that grouped sounds do not throw an error if all the individual segments are valid? I could try to implement either solution.

FredericBlum commented 2 months ago

@LinguList Tagging you on this again to see how we proceed with this.

LinguList commented 2 months ago

The current workaround that would also guarantee backwards compat is to have a specific Lexeme class.

from pylexibank import Lexeme

@attr.s
class CustomLexeme(Lexeme):
    Grouped_Segments = attr.ib(default=None, metadata={"datatype": "string", "separator": " "})

And then you add a function ungroup to your data (if you have a default profile that groups.

def ungroup(segments):
    return [segment for segment_group in [s.split(".") for s in segments] for segment in segment_group]

Then you add in args.writer.add_form as Segments=ungroup(segments) and Grouped_Segments=segments.

FredericBlum commented 2 months ago

Yes, that"s what I am doing right now in my repositories. I just thought we could add this function to pylexibank and modify the Lexeme class to make this workaround unnecessary.

LinguList commented 2 months ago

I think this would be too much by now, since it is not part of the CLDF specification.

LinguList commented 2 months ago

I also wonder how we want to handle this in the future. If se say, Segments can be potentially Grouped, we have a situation where we may have clashes, so my idea would be to propose Grouped_Segments as another representation of segments to CLDF, but I suggest we wait for the reviews of the grouping sounds paper to see how we react here? With the paper, we have the reference to add this to lexibank.

xrotwang commented 2 months ago

Yes, I think both are needed - a reference and more experience, including actively searching for cases with conflicting options to group. I think in the worst case, grouping of segments would introduce a degree of freedom which invites abuse where fine-tuned grouping together with fine-tuned analysis algorithms create intransparent results.

I could also imagine that grouping and trimming used together could have funny effects.

LinguList commented 2 months ago

We already have examples of this kind. The freedom that this introduces at times may be so great that one can have two different outcomes of the same analysis due to grouping alone. The current solution leaves everything open but makes clear that this is a currently tested candidate for inclusion into CLDF in a later version. We can discuss already now -- also with respect to the modification of the paper -- if we want to propose a little plugin that could be used to make the conversion easier in lexibank (but would require to be installed on top of it and could be added there later).

FredericBlum commented 2 months ago

I agree with waiting until the reviews to see how we do this. With respect to the degrees of freedom: We would not touch the Segments of a Lexeme, but add a new category. Maybe it is unproblematic that we have freedom there? Through ungrouping and checking the transcription in Segments, we will still have the verification that the individual segments conform with CLTS.

LinguList commented 2 months ago

That is also an open question for now: do we add Grouped_Segments or do we not add it? The good thing is: the current solution is like an independent library in Python: if you conform to it and also make sure to define the metadata properly, you can already work with the feature in lexibank / cldf datasets, but you must make sure to select the datasets by hand. So the difference with respect to being officially mentioned in cldf and not being mentioned is not that big, right?