lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
18 stars 7 forks source link

Mark in FormTable which profile a segmentation was based on #207

Closed xrotwang closed 4 years ago

xrotwang commented 4 years ago

Since we support tokenization based on multiple profiles per dataset, it would be good to be able to trace back which profile was used for which form - e.g. to compute frequencies in the profile linter.

This could be done by adding a Profile column to FormTable. But it isn't fully clear what values to use for this column. One candidate would be the keys of the tokenizers dict, i.e. the stems of filenames in etc/orthography. Full paths might be more transparent, but would also blow up the forms table. So I'd propose:

xrotwang commented 4 years ago

@tresoldi AFAICS, this should be sufficient to re-implement the language discriminator in your profile tools, right?

tresoldi commented 4 years ago

@xrotwang yes. And on the proposal, I full support it. :smiley:

LinguList commented 4 years ago

What about adding a Graphemes to the profile creator?

This would render:

Thochter

with profile

Graphemes   IPA
^T  t
o   ɔ
ch  x
t   t
er$ ɐ

as

^T o ch t er$

and

t ɔ x t ɐ

respectively and keep the exact commands one used for a given sequence?

Also useful for debugging, in fact (!).

LinguList commented 4 years ago

And it won't cost anything to produce it: whenever tokens are added from profile, they will be automatically created by repeating segmentation with graphemes instead of the IPA column.

xrotwang commented 4 years ago

@LinguList I don't follow. What exactly would you want to add to which table?

tresoldi commented 4 years ago

Would it be similar to the output of my tool when setting --debug?

xrotwang commented 4 years ago

@tresoldi I'd change my proposal to

realizing that a column mixing booleans and strings is weird.

tresoldi commented 4 years ago

:+1:

LinguList commented 4 years ago

@LinguList I don't follow. What exactly would you want to add to which table?

So the forms.csv should have an extra field "Graphemes", that stores the segmented Graphemes, not the segmented IPA. This is a direct conversion schema for the original Form, right?

And this is information we lose currently, specifically when using a profile, manually changing it, or whatsoever.

LinguList commented 4 years ago

But I now see that you weren't discussing about the forms.csv. Anyway, I think storing the Grapheme version would not hurt.

tresoldi commented 4 years ago

I agree, and it can help a lot when writing profiles. It is a good solution for cases when you expect axb to be matched, but the tokenizer is actually matching something like ^ax + b. My --debug flag gives a different {}/{} notation, but only because the way I usually grep things -- having a separate column is better and aggregates useful information.

xrotwang commented 4 years ago

@LinguList We did talk about forms.csv, and now I see what you mean. Other than potentially blowing up the size of forms.csv, I don't see much of a downside with adding Graphemes.

xrotwang commented 4 years ago

Size might actually become an issue, though. IDS forms.csv is at 46MB now. But then, we may split IDS into multiple smaller sets anyway.

LinguList commented 4 years ago

Well, you know, @xrotwang, I think the size matter is also part of the design. We can already now add Graphemes into forms.csv, but we would have to do it manually. If we allow for a simple flag that allows to add Graphemes as well, and lexibankers need to decide if they include it or not (and if Segments are hand-made, they won't need to do anything anyway), I think it could be mostly fine. And when we reach the big file sizes, we should probably split anyway, as they are aggregated databases anyway, as is IDS, where we already decided that we cannot segment it sensefully by doing one profile alone...