Please split data set and cli tooling into separate projects/repositories

alerque commented 3 years ago

Migrated from #14 where it is slightly off topic:

I really hope the web interface and ability to use it via CLI and package are beneficial here.

Sure they are! But that isn't the question. The question is about the data. In fact this highlights something I'm convinced is a major mistake in this project: the data set should be one project and the CLI tooling should be another. Having these coupled in one repository will be a serious limitation down the road. The data may well prove useful in other contexts where the tooling would be clutter, and the tooling should be decoupled from the data versioning so people can potentially use the tooling with different data (older versions, a fork for contested data, substituting CLDR data, etc.). Perhaps some day you want to rework the tooling a bit so the CLI works differently. That may break a lot of old projects that would otherwise refresh their data, but now the tooling would block them.

MrBrezina commented 3 years ago

That’s a fair point. We merged it purely for practical reasons (it was practical for us). Let’s wait until we resolve some early issues with the data structure we still have.

alerque commented 3 years ago

Sure, for early prototyping it's handy to mess with the data structures and tooling at the same time without separate procedures for collating them. Just don't wait too long to get them split up...every person that starts using this for anything will have to refactor as soon as you do, so the balance between "easier for us developers doing early prototyping" and "easier for consumers" will tip sooner that developers tend to notice. In particular don't save it for "the big 1.0.0", you want to hash out the way the projects correlate and the release process before you call it good, not at the same time.

kontur commented 3 years ago

One thing to note here is that reading the plain yaml dataset with the Python library actually augments it (orthography inheritance, macrolanguages, glyph decomposition for checking, etc.). So I think there are three components:

Raw data
Python wrapper around the data
CLI tools

In terms of maintaining data integrity in the database yaml we are using a bunch of scripts for validating and saving as well — these could be separated as "tests" for the data, but I think it might also be valuable to emphasize out the pythonic way of accessing the data, as opposed to using the yaml.

E.g. the database use:

from hyperglot.languages import Languages
langs = Languages()
print(langs['eng'])
print(langs.get_support_from_chars(["A", "B", ...]))

It's not something we documented well so far, but imo the yaml is "only" data input. One thing I had considered also is generating and using a pickled cache object from the yaml for accessing the language data programmatically.

All that said, I think the point made is a good one. We are seeing the same issue with conflict of concerns now that we have publicized it in regard to issues and PR being split between CLI & data.

rosettatype / hyperglot

Please split data set and cli tooling into separate projects/repositories #29