Closed xrotwang closed 5 years ago
What is this based on? The ortho-profile? I think, it would be better to use lingpy's profile facility. As far as I can see, this isn't what this does?
It only uses the grapheme cluster pattern from segments
. And no, it doesn't use lingpy
. Possibly different orthography profile creation algorithms could be available from the lexibank orthography
command?
Yes, I would strongly support this, specifically given that the lingpy thing is superior in lumping (and if something's not lumped, you can't split it, so it will mask graphemes, like "th", which aren't detected by segments.
The recommended way by now, after first cldf has created, is (in my opinion):
lingpy --cldf --context --clts --column=form -i cldf/cldf-metadata.json -o etc/orthography.tsv
Can we somehow pass this command through lexibank in a straightforward way?
Nah, we shouldn't go through the shell here - and we may not even have cldf yet (hence the iter_raw_lexemes
method). I'll see how to integrate lingpy
's algo in the lexibank command.
But be aware of one crucial point here: since good orthography is only possible when actually having forms in a ready form (i.e., applying splitting of cells, etc.), it may usually be the only way to get the orthography profile by first having created one cldf with forms but without orthography! This is a particularly difficult aspect of the workflow, also difficult to understand for newbies, but I do not really see a workaround (unless one curates the things completely independent of lexibank).
If you look at it as a circular dependency, then yes, it's difficult. But if we make it clear(er) that the workflow is supposed to be run in multiple iterations, it should be reasonably easy to understand.
But yes, where exactly to fiddle with things in between iterations is still somewhat complicated.
So, maybe the lexibank orthography
command should be able to work either from iter_raw_lexemes
or from (early iterations of) cldf/forms.csv
?
Hm, after a bit more thinking, I guess we should just throw away the lexibank orthography
command, and instead document the lingpy
method appropiately.
Yes, I think that may also just be the easiest way: iter-raw-lexemes is also handled by lingpy, and we can also enhance lingpy's function a bit more. @tresoldi was using some tools for data exploration as well, and I have this js website: http://digling.org/profile/
So what I mean: a general documentation on the issue, with some tool-chain and best practice for actually getting a good profile, etc., may just be what would be needed. Maybe an idea for a blogpost?
I'm wondering if we can do a sensible default orthography somehow as part of the makecldf process (rather than as a separate step)? and then the tools can be used to refine and fix the initial profile?
You could run the cldf process and use the form (as this is what we need) and the same method used by lingpy (which is trivial). But there's danger of overrwriting an orthography profile that exists. But once could --maybe with a flag -- create an orthoprofile whenever the system creates forms, or the user specifies this?
Well, if a profile is there, you could just generate an alternative one with a .lingpy
filename suffix or similar.
Ok, I actually like @SimonGreenhill 's idea of wiring default profile creation into the makecldf
command. So here's my proposal:
makecldf
triggers creating an orthography profile from the Form
column in cldf/forms.csv
using the lingpy
method.etc/orthography.tsv.lingpy
, otherwise to etc/orthography.tsv
.Since dataset curators will check the output of makecldf
anyway, using git status
, etc. the implicit creation of the profile is not going to cause confusion, I think.
Note to self: Incorporate streamlined version of lingpy.cli.profile
as replacement for the (then deprecated) lexibank orthography
command in Dataset._install
.
I strongly support separating the "form preparation" from the "tokenization+ipa conversion", which I believe is what you are suggesting.
I have probably been discussing the profiles too much, and this circularity is one of my quibbles. At least for me, it makes profile debugging difficult, especially for those automatically generated by lingpy where I don't have a full mental model in mind at the beginning and it is not immediately clear which longer substring is available and will be consumed. Things like the ^
and $
being added in code or the form received by the tokenizer being already a clean one (brackets, spaces, and so on) are the most common problems I run into. This is why I wrote some tools for exploring that, like @LinguList pointed, such as in the screenshot below.
From a more theoretical point of view, my issue is that we are not always using orthographic profiles to extract information from orthography. In many datasets, and I am also guilty here, the profiles end up being used also to perform what is data cleaning and normalization. This is not how lexibank is intended to work, but it is what happens in the end: if we now that in our orthography Å means a long /o/ but it is found both as a precomposed and a as decomposable character, our profiles tend to add a mapping for each entry. Similar things happened with the naganorgyalrongic dataset, where tones are annotated in at least four different ways.
Having some per-dataset form preparation function (which would include stripping brackets etc.) that would also take care of normalizing stuff as much possible (with per-doculect settings, if necessary, and also relying on lexemes.csv
when necessary) would be very good and would make the profiles themselves shorter and simpler.
@tresoldi but we do have "per dataset form preparation" - by overriding clean_form
, and/or providing wholesale replacements for problematic forms in etc/lexemes.csv
. But I guess this machinery may be somewhat difficult to evolve together in an iterative (not circular, I'd hope) process. And yes, the profile is probably the most difficult piece to develop in an iterative way - e.g. once you push a lexeme to etc/lexemes.csv
, it is not immediately clear which lines of the profile are no longer needed.
What helped me most in similar situations in software development was a good set of (unit) tests. So maybe we should try to figure out how to add this for lexibank? We already have integration tests, but maybe we should have unit tests, too: e.g. a way to specify input and exected output of any of the customizable things in a dataset, like clean_form
, split_forms
, etc.?
@xrotwang overriding clean_form
was not clear for me at first, it was one of the problems of my setups, but the circularity problem would persist if everything is under makecldf
.
My suggestion would be to have a new command (something like develcldf
), which would use the same overridden methods (clean_form()
, split_forms
, lexemes
, etc.) and output a single csv file with a column containing the forms exactly as they would be fed to the tokenizer (including additional stuff like the ^
and $
boundary markers and so on) -- any method for writing a profile (be it manual, using lingpy, etc.) would rely on this data for their training debugging.
What I've been doing in my TNG dataset is to have a cleaned and non-cleaned form. Maybe we could just include the raw form in forms.csv (e.g. "Value") which may or may not be different to the processed form in "Form". Then we maintain transparency (comparing/diff'ing Value vs Form is easy) as well as being able to track down profile glitches more easily.
(and I'm all for more tests, I'm not smart enough to check these things manually all the time, so lets offload it).
One thing that might be useful is that we count how many times a particular profile rule is applied (is this easy? I don't know). Anything that's never applied is logged as a warning, which will help profiles stay tidy and free of cruft.
@SimonGreenhill actually, I realize this is pretty much what we have right now: put the cleaned form (clean_form + split_forms + lexemes...) in Form
and have any method (like lingpy) trained on that (i.e., in the cldf/forms.csv
file).
As for how many times any given rule is used, this is something I also needed sometime ago: that debugging tool I wrote can provide this information. @LinguList told me to clean it up and make it available at least internally, should get back to it...
We have reached a situation where everybody of us is doing their thing, also because we are talented enough in this kind of messy coding, but we should think of how we can combine efforts not to make this easier. The problem is that the two-step from value to form is definitely ALWAYS needed, and the step from form to segments as well. So we'll have an iterative procedure in any case, but we'd profit from general explorers for datasets, such as the one I wrote in JS, which simply lists bi- and tri-grams in the data. LingPy is one way to create an initial profile, and it is more consistent than segments itself. The online-js implementation of segments at http://calc.digling.org/profile is yet another example for an application that can be used of further developed, but again: the "form" as the first step of normalization is needed.
So to just summarize quickly: our combined mehtods / best practice needs to account for the two-stage process of going value to form, and from form to segments, and ideally, it should offer some explorative interface, e.g., also allow to test what happens, like: "this is a splitter, search for other ptential splitters", "this is a bracket, see how it is stripped of, search for non-closing brackets", etc.
There's also a proposal here to improve profiles: https://github.com/lexibank/lexibank/issues/105
superseded by cldfbench lexibank.init_profile
.
The command to seed an orthography profile uses an invalid column name.
graphemes
here should beGrapheme
.The profile should also be written to
etc/
right away - saving the step of copying the file to the correct location.