lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
17 stars 7 forks source link

Wrong column name in default orthography profile #87

Closed xrotwang closed 5 years ago

xrotwang commented 5 years ago

The command to seed an orthography profile uses an invalid column name. graphemes here should be Grapheme.

The profile should also be written to etc/ right away - saving the step of copying the file to the correct location.

LinguList commented 5 years ago

What is this based on? The ortho-profile? I think, it would be better to use lingpy's profile facility. As far as I can see, this isn't what this does?

xrotwang commented 5 years ago

It only uses the grapheme cluster pattern from segments. And no, it doesn't use lingpy. Possibly different orthography profile creation algorithms could be available from the lexibank orthography command?

LinguList commented 5 years ago

Yes, I would strongly support this, specifically given that the lingpy thing is superior in lumping (and if something's not lumped, you can't split it, so it will mask graphemes, like "th", which aren't detected by segments.

The recommended way by now, after first cldf has created, is (in my opinion):

lingpy --cldf --context --clts --column=form -i cldf/cldf-metadata.json -o etc/orthography.tsv

Can we somehow pass this command through lexibank in a straightforward way?

xrotwang commented 5 years ago

Nah, we shouldn't go through the shell here - and we may not even have cldf yet (hence the iter_raw_lexemes method). I'll see how to integrate lingpy's algo in the lexibank command.

LinguList commented 5 years ago

But be aware of one crucial point here: since good orthography is only possible when actually having forms in a ready form (i.e., applying splitting of cells, etc.), it may usually be the only way to get the orthography profile by first having created one cldf with forms but without orthography! This is a particularly difficult aspect of the workflow, also difficult to understand for newbies, but I do not really see a workaround (unless one curates the things completely independent of lexibank).

xrotwang commented 5 years ago

If you look at it as a circular dependency, then yes, it's difficult. But if we make it clear(er) that the workflow is supposed to be run in multiple iterations, it should be reasonably easy to understand.

xrotwang commented 5 years ago

But yes, where exactly to fiddle with things in between iterations is still somewhat complicated.

xrotwang commented 5 years ago

So, maybe the lexibank orthography command should be able to work either from iter_raw_lexemes or from (early iterations of) cldf/forms.csv?

xrotwang commented 5 years ago

Hm, after a bit more thinking, I guess we should just throw away the lexibank orthography command, and instead document the lingpy method appropiately.

LinguList commented 5 years ago

Yes, I think that may also just be the easiest way: iter-raw-lexemes is also handled by lingpy, and we can also enhance lingpy's function a bit more. @tresoldi was using some tools for data exploration as well, and I have this js website: http://digling.org/profile/

So what I mean: a general documentation on the issue, with some tool-chain and best practice for actually getting a good profile, etc., may just be what would be needed. Maybe an idea for a blogpost?

SimonGreenhill commented 5 years ago

I'm wondering if we can do a sensible default orthography somehow as part of the makecldf process (rather than as a separate step)? and then the tools can be used to refine and fix the initial profile?

LinguList commented 5 years ago

You could run the cldf process and use the form (as this is what we need) and the same method used by lingpy (which is trivial). But there's danger of overrwriting an orthography profile that exists. But once could --maybe with a flag -- create an orthoprofile whenever the system creates forms, or the user specifies this?

xrotwang commented 5 years ago

Well, if a profile is there, you could just generate an alternative one with a .lingpy filename suffix or similar.

xrotwang commented 5 years ago

Ok, I actually like @SimonGreenhill 's idea of wiring default profile creation into the makecldf command. So here's my proposal:

  1. Running makecldf triggers creating an orthography profile from the Form column in cldf/forms.csv using the lingpy method.
  2. If an orthography profile is already present, the new one will be written to etc/orthography.tsv.lingpy, otherwise to etc/orthography.tsv.

Since dataset curators will check the output of makecldf anyway, using git status, etc. the implicit creation of the profile is not going to cause confusion, I think.

xrotwang commented 5 years ago

Note to self: Incorporate streamlined version of lingpy.cli.profile as replacement for the (then deprecated) lexibank orthography command in Dataset._install.

tresoldi commented 5 years ago

I strongly support separating the "form preparation" from the "tokenization+ipa conversion", which I believe is what you are suggesting.

I have probably been discussing the profiles too much, and this circularity is one of my quibbles. At least for me, it makes profile debugging difficult, especially for those automatically generated by lingpy where I don't have a full mental model in mind at the beginning and it is not immediately clear which longer substring is available and will be consumed. Things like the ^ and $ being added in code or the form received by the tokenizer being already a clean one (brackets, spaces, and so on) are the most common problems I run into. This is why I wrote some tools for exploring that, like @LinguList pointed, such as in the screenshot below.

From a more theoretical point of view, my issue is that we are not always using orthographic profiles to extract information from orthography. In many datasets, and I am also guilty here, the profiles end up being used also to perform what is data cleaning and normalization. This is not how lexibank is intended to work, but it is what happens in the end: if we now that in our orthography Å means a long /o/ but it is found both as a precomposed and a as decomposable character, our profiles tend to add a mapping for each entry. Similar things happened with the naganorgyalrongic dataset, where tones are annotated in at least four different ways.

Having some per-dataset form preparation function (which would include stripping brackets etc.) that would also take care of normalizing stuff as much possible (with per-doculect settings, if necessary, and also relying on lexemes.csv when necessary) would be very good and would make the profiles themselves shorter and simpler.

screenshot

xrotwang commented 5 years ago

@tresoldi but we do have "per dataset form preparation" - by overriding clean_form, and/or providing wholesale replacements for problematic forms in etc/lexemes.csv. But I guess this machinery may be somewhat difficult to evolve together in an iterative (not circular, I'd hope) process. And yes, the profile is probably the most difficult piece to develop in an iterative way - e.g. once you push a lexeme to etc/lexemes.csv, it is not immediately clear which lines of the profile are no longer needed.

What helped me most in similar situations in software development was a good set of (unit) tests. So maybe we should try to figure out how to add this for lexibank? We already have integration tests, but maybe we should have unit tests, too: e.g. a way to specify input and exected output of any of the customizable things in a dataset, like clean_form, split_forms, etc.?

tresoldi commented 5 years ago

@xrotwang overriding clean_form was not clear for me at first, it was one of the problems of my setups, but the circularity problem would persist if everything is under makecldf.

My suggestion would be to have a new command (something like develcldf), which would use the same overridden methods (clean_form(), split_forms, lexemes, etc.) and output a single csv file with a column containing the forms exactly as they would be fed to the tokenizer (including additional stuff like the ^ and $ boundary markers and so on) -- any method for writing a profile (be it manual, using lingpy, etc.) would rely on this data for their training debugging.

SimonGreenhill commented 5 years ago

What I've been doing in my TNG dataset is to have a cleaned and non-cleaned form. Maybe we could just include the raw form in forms.csv (e.g. "Value") which may or may not be different to the processed form in "Form". Then we maintain transparency (comparing/diff'ing Value vs Form is easy) as well as being able to track down profile glitches more easily.

SimonGreenhill commented 5 years ago

(and I'm all for more tests, I'm not smart enough to check these things manually all the time, so lets offload it).

One thing that might be useful is that we count how many times a particular profile rule is applied (is this easy? I don't know). Anything that's never applied is logged as a warning, which will help profiles stay tidy and free of cruft.

tresoldi commented 5 years ago

@SimonGreenhill actually, I realize this is pretty much what we have right now: put the cleaned form (clean_form + split_forms + lexemes...) in Form and have any method (like lingpy) trained on that (i.e., in the cldf/forms.csv file).

As for how many times any given rule is used, this is something I also needed sometime ago: that debugging tool I wrote can provide this information. @LinguList told me to clean it up and make it available at least internally, should get back to it...

LinguList commented 5 years ago

We have reached a situation where everybody of us is doing their thing, also because we are talented enough in this kind of messy coding, but we should think of how we can combine efforts not to make this easier. The problem is that the two-step from value to form is definitely ALWAYS needed, and the step from form to segments as well. So we'll have an iterative procedure in any case, but we'd profit from general explorers for datasets, such as the one I wrote in JS, which simply lists bi- and tri-grams in the data. LingPy is one way to create an initial profile, and it is more consistent than segments itself. The online-js implementation of segments at http://calc.digling.org/profile is yet another example for an application that can be used of further developed, but again: the "form" as the first step of normalization is needed.

So to just summarize quickly: our combined mehtods / best practice needs to account for the two-stage process of going value to form, and from form to segments, and ideally, it should offer some explorative interface, e.g., also allow to test what happens, like: "this is a splitter, search for other ptential splitters", "this is a bracket, see how it is stripped of, search for non-closing brackets", etc.

SimonGreenhill commented 5 years ago

There's also a proposal here to improve profiles: https://github.com/lexibank/lexibank/issues/105

xrotwang commented 5 years ago

superseded by cldfbench lexibank.init_profile.