Closed xrotwang closed 4 years ago
@tresoldi AFAICS, this should be sufficient to re-implement the language discriminator in your profile tools, right?
@xrotwang yes. And on the proposal, I full support it. :smiley:
What about adding a Graphemes to the profile creator?
This would render:
Thochter
with profile
Graphemes IPA
^T t
o ɔ
ch x
t t
er$ ɐ
as
^T o ch t er$
and
t ɔ x t ɐ
respectively and keep the exact commands one used for a given sequence?
Also useful for debugging, in fact (!).
And it won't cost anything to produce it: whenever tokens are added from profile, they will be automatically created by repeating segmentation with graphemes instead of the IPA column.
@LinguList I don't follow. What exactly would you want to add to which table?
Would it be similar to the output of my tool when setting --debug
?
@tresoldi I'd change my proposal to
None
: no profile"default"
: etc/orthography.tsv
<key>
realizing that a column mixing booleans and strings is weird.
:+1:
@LinguList I don't follow. What exactly would you want to add to which table?
So the forms.csv
should have an extra field "Graphemes", that stores the segmented Graphemes, not the segmented IPA. This is a direct conversion schema for the original Form
, right?
And this is information we lose currently, specifically when using a profile, manually changing it, or whatsoever.
But I now see that you weren't discussing about the forms.csv
. Anyway, I think storing the Grapheme version would not hurt.
I agree, and it can help a lot when writing profiles. It is a good solution for cases when you expect axb
to be matched, but the tokenizer is actually matching something like ^ax + b
. My --debug
flag gives a different {}/{}
notation, but only because the way I usually grep things -- having a separate column is better and aggregates useful information.
@LinguList We did talk about forms.csv
, and now I see what you mean. Other than potentially blowing up the size of forms.csv
, I don't see much of a downside with adding Graphemes
.
Size might actually become an issue, though. IDS forms.csv
is at 46MB now. But then, we may split IDS into multiple smaller sets anyway.
Well, you know, @xrotwang, I think the size matter is also part of the design. We can already now add Graphemes into forms.csv, but we would have to do it manually. If we allow for a simple flag that allows to add Graphemes as well, and lexibankers need to decide if they include it or not (and if Segments are hand-made, they won't need to do anything anyway), I think it could be mostly fine. And when we reach the big file sizes, we should probably split anyway, as they are aggregated databases anyway, as is IDS, where we already decided that we cannot segment it sensefully by doing one profile alone...
Since we support tokenization based on multiple profiles per dataset, it would be good to be able to trace back which profile was used for which form - e.g. to compute frequencies in the profile linter.
This could be done by adding a
Profile
column toFormTable
. But it isn't fully clear what values to use for this column. One candidate would be the keys of the tokenizersdict
, i.e. the stems of filenames inetc/orthography
. Full paths might be more transparent, but would also blow up the forms table. So I'd propose:False
: no profile used. Segments have been provided in another way.True
: segmented viaetc/orthography.tsv
<stem>
: segmented viaetc/<stem>.tsv