Open FredericBlum opened 1 year ago
We'd need this ideally for Lexibank 2.0.
One option would be to add the ungroup()
function that we have used so far to pylexibank. Would that be a reasonable step? Or should we modify the check-up in a way that grouped sounds do not throw an error if all the individual segments are valid? I could try to implement either solution.
@LinguList Tagging you on this again to see how we proceed with this.
The current workaround that would also guarantee backwards compat is to have a specific Lexeme class.
from pylexibank import Lexeme
@attr.s
class CustomLexeme(Lexeme):
Grouped_Segments = attr.ib(default=None, metadata={"datatype": "string", "separator": " "})
And then you add a function ungroup
to your data (if you have a default profile that groups.
def ungroup(segments):
return [segment for segment_group in [s.split(".") for s in segments] for segment in segment_group]
Then you add in args.writer.add_form
as Segments=ungroup(segments)
and Grouped_Segments=segments
.
Yes, that"s what I am doing right now in my repositories. I just thought we could add this function to pylexibank and modify the Lexeme class to make this workaround unnecessary.
I think this would be too much by now, since it is not part of the CLDF specification.
I also wonder how we want to handle this in the future. If se say, Segments can be potentially Grouped, we have a situation where we may have clashes, so my idea would be to propose Grouped_Segments as another representation of segments to CLDF, but I suggest we wait for the reviews of the grouping sounds paper to see how we react here? With the paper, we have the reference to add this to lexibank.
Yes, I think both are needed - a reference and more experience, including actively searching for cases with conflicting options to group. I think in the worst case, grouping of segments would introduce a degree of freedom which invites abuse where fine-tuned grouping together with fine-tuned analysis algorithms create intransparent results.
I could also imagine that grouping and trimming used together could have funny effects.
We already have examples of this kind. The freedom that this introduces at times may be so great that one can have two different outcomes of the same analysis due to grouping alone. The current solution leaves everything open but makes clear that this is a currently tested candidate for inclusion into CLDF in a later version. We can discuss already now -- also with respect to the modification of the paper -- if we want to propose a little plugin that could be used to make the conversion easier in lexibank (but would require to be installed on top of it and could be added there later).
I agree with waiting until the reviews to see how we do this. With respect to the degrees of freedom: We would not touch the Segments
of a Lexeme, but add a new category. Maybe it is unproblematic that we have freedom there? Through ungrouping and checking the transcription in Segments
, we will still have the verification that the individual segments conform with CLTS.
That is also an open question for now: do we add Grouped_Segments
or do we not add it? The good thing is: the current solution is like an independent library in Python: if you conform to it and also make sure to define the metadata properly, you can already work with the feature in lexibank / cldf datasets, but you must make sure to select the datasets by hand. So the difference with respect to being officially mentioned in cldf and not being mentioned is not that big, right?
As referenced in an example case, we should modify pylexibank in a way that allows us to support grouped sounds (e.g.
a.ʔ
) instead of reporting a transcription error for those cases.