polm / cutlet

Japanese to romaji converter in Python
https://polm.github.io/cutlet/
MIT License
286 stars 20 forks source link

v1.0 API Proposal #54

Open polm opened 3 months ago

polm commented 3 months ago

Cutlet has been out for a few years now, and while I consider it basically functionally complete, the API is a little awkward as it's evolved over time. Since it's stable, I'd also like to release a v1.0 to indicate the API is reliable in the future. This issue is for my proposal and also to solicit feedback.

This is not a full API proposal - most of the evolution will be iterative and minor, like cleaning up which functions are public vs private. The main thing I want to do is make treatment of the different output options a little more clear. To that end I propose that the Cutlet object has the following main public methods of interest:

A CutletDoc is inspired by a spaCy Doc object and contains:

The CutletDoc object has a few advantages. One is that if you need two of the above output formats, it allows you to avoid duplicate computation (MeCab calls) without having to manage state yourself. The other is that it can codify linking MeCab tokens to romaji tokens. The linking is very simple, but it's a commonly requested feature (#34, #37, #40, etc.), and (partly due to lack of examples on my part) users often find it confusing, so it would be good to provide a canonical process.

Separately, I will try making RomajiTokens proxy classes for MeCab tokens. I think this will work without issue, but it's possible that MeCab Nodes being Cython objects will be a problem.

While the API will change, the actual internal code will not change very much as part of this process. At the fastest this will take a few months, and a new version with DeprecationWarnings will be released. If you have a stable application and are happy with the current API, please be sure to use version guards.