Localization change tracking for fluent

kellpossible / cargo-i18n

A Rust Cargo sub-command and libraries to extract and build localization resources to embed in your application/library

MIT License

121 stars 25 forks source link

Localization change tracking for fluent #89

Open kellpossible opened 2 years ago

kellpossible commented 2 years ago

It would be nice to have some kind of localization change tracking tool/library for fluent that can indicate when messages are out of date.

Related to #31 Related to #83

kellpossible commented 11 months ago

Okay I have a proposal about how this change tracking could work.

Requirements

All data related to tracking state needs to be available in plain text in the repository with the localization resource files.
Needs to function as expected with uncommitted state in the repo.
If possible, keep the amount of data being tracked to a minimum, use of git features is encouraged, seeing as we already have a database tracking the changes in these files (this can also be used to cross-reference message changes to git commits).
Sending the source code and the git repository to translators may not be appropriate.
If a message is changed in the primary language, then this needs to be detected and each translated message needs to be marked as out-of-date and manually updated and marked as translated/up-to-date by the translators.
If a message in the primary language is deleted, then the possibility for a rename should be detected, and the translators should be presented with the option to either remove the message, or rename the message key (with a suggestion to match any detected renames within some message contents edit distance threshold).
After deletions/renames have been resolved, if a message in the primary language is created, then there should be a suggestion to create a corresponding one in the translation.
If the edit distance between the different versions of primary message is large enough, then there should be a suggestion for copying the contents of other messages which are close in edit distance as a starting point for the translation.

Design

I would propose that we have an associated json file per translated file, or perhaps a single json file per language with a subsection per file. Within this file, it contains an entry per message in the translated files with the following data:

The message key.
The git commit hash for a version of the associated primary language file.
A SHA256 hash of the primary file.

Terminology:

git hash
- The git hash of a commit to the primary file.
file hash
- The SHA256 hash of the primary file. Used to cover the use case for editing both translation files and the status of them in the same commit before commit hash becomes available.
Latest commit to primary (LCP)
- The latest commit that changed the corresponding primary localization asset file.

To detect whether a message is currently in sync the following steps are taken:


flowchart TD
    IS["In Sync"]
    OSC["Out of Sync (Changed)"]
    OSR["Out of Sync (Renamed)"]
    S[Start] --> E{Does the hash of primary\n match the file hash?}
    E --> |Yes| IS
    E --> |No| F{Does the message in primary match\n primary checked out at git hash?}
    F --> |Yes| IS
    F --> |No| B{"Does the git hash match LCP?"}
    B -->|Yes| IS
    B -->|No| C{Does the message in the\n primary file checked out\n at git hash match the\n same message at LCP?}
    C --> |Yes| IS
    C --> |No| D{In the commit after git hash\n where the message was deleted from\n primary file, are there any new messages\n matching the message in git hash?}
    D  --> |Yes| OSR
    D --> |No| OSC

To mark an item as in-sync, all that needs to happen is to set the file hash to the hash of the primary file, and the git hash to the LCP.

kellpossible commented 11 months ago

We will need a way for translators to interact with this system. The long term goal is to create a GUI for #83 which could be used to do this, however sending the source code and the git repository to translators may not be appropriate. There there should also be a way to pre-build this information into a tree structure and export to a json file that can be "edited" with a single change overlayed to be re-integrated.

Without a GUI tool to perform this, perhaps a viable alternative could be to generate an excel spreadsheet which can be sent to the translator containing only the required changes and associated comments, and a way to re-integrate it using the cargo i18n CLI?

kellpossible commented 11 months ago

Perhaps the design could be simplified by not using git hashes, but rather by searching back through the history to find the file that matches the file hash.

alerque commented 11 months ago

Not duplicating data like checksums that could be deduced through other tooling would be a huge advantage. I do suggest looking into that.

Otherwise almost the entire thing I read makes a lot of sense and sounds great. I can't wait for tooling (both CLI and GUI) to support this.

kellpossible commented 11 months ago

@alerque

Not duplicating data like checksums that could be deduced through other tooling would be a huge advantage. I do suggest looking into that.

I'm not sure I understand, would you be able to elaborate on that? Do you mean, if available, try to re-use the checksums for the file history if they are already available in your version control system? Perhaps we could make the source of checksums pluggable somehow?

alerque commented 11 months ago

I'm not sure I understand, would you be able to elaborate on that?

Sure, happy to try to explain and maybe even help work on this.

Do you mean, if available, try to re-use the checksums for the file history if they are already available in your version control system?

Not quite, I was actually thinking about preempting that so that it is always available ahead of time. Back to this is a second.

Perhaps we could make the source of checksums pluggable somehow?

Yes, that would actually be ideal. Like most developers these days I'm pretty heavily invested in Git so that was the use case I had in mind, but making the entire checksum system pluggable would make it possible to implement within any VCS (or none). The default provider could be the Git one talked about here while leaving room for some other way of drumming up checksums that serve a specific purpose.

So back to Git and file checksums.

Your original outline included storing two checksums, one commit SHA and one file hash SHA. I understand why both the last commit is useful and why the commit hash is not available before commit (which is when you'd need to store an updated value without a two commit system).

Used to cover the use case for editing both translation files and the status of them in the same commit before commit hash becomes available.

But why not have your cake and eat it too? Instead of storing either of those values I suggest storing an object hash generated by Git. You can generate such a hash for any arbitrary file (tracked or untracked) using git hash-object <filename>. This is effectively a checksum of the current state of the file mixed in with some special Git sauce. The important thing is that you can get this hash prior to commiting (or even tracking) a file and that it is stable for a given set of file contents, and also that this hash is used to track the object subsequent to committing. You can later retrieve the object itself by the object hash (e.g. for diffing purposes) and also look up which commits contain the object and hence derive where it was first / last committed. There are plumbing commands for this of course, but a porcelain one for demonstration:

$ date > myfile
$ git hash-object myfile
119142f1bf27dcb9e059495206c64c404db90af4
$ git add myfile
$ git commit -m "Track and commit"
[master 4344ad1] Track and commit
 1 file changed, 1 insertion(+)
 create mode 100644 myfile
$ git log --raw --all --find-object=119142f1bf27dcb9e059495206c64c404db90af4
commit 4344ad1 (HEAD -> master)
Author: Caleb Maclennan <caleb@alerque.com>
Date:   Wed Nov 22 09:47:14 2023 +0300

    Track and commit

:000000 100644 0000000 119142f A        myfile

Hence my suggestion was to only store one checksum. In the case of Git that one value can be the object hash and used to identify (and retrieve) the exact file contents before or after committing and also to look up the commit history for it.

kellpossible commented 11 months ago

Sure, happy to try to explain and maybe even help work on this.

I would definitely love to collaborate with more people on this project if you have some time! We can create diagram soon so we can agree on exactly how it should work and it could be used in the documentation for the system later.

But why not have your cake and eat it too? Instead of storing either of those values I suggest storing an object hash generated by Git. You can generate such a hash for any arbitrary file (tracked or untracked) using git hash-object . This is effectively a checksum of the current state of the file mixed in with some special Git sauce. The important thing is that you can get this hash prior to commiting (or even tracking) a file and that it is stable for a given set of file contents, and also that this hash is used to track the object subsequent to committing. You can later retrieve the object itself by the object hash (e.g. for diffing purposes) and also look up which commits contain the object and hence derive where it was first / last committed. There are plumbing commands for this of course, but a porcelain one for demonstration:

Ah yes I saw this git hash-object today after reading your first comment!

In my other comment: Perhaps the design could be simplified by not using git hashes, but rather by searching back through the history to find the file that matches the file hash. I was thinking perhaps to do away with any reliance on git hashing entirely and hash the file ourselves on demand, and only rely on git (or any other VCS) to produce the previous versions of the file that we can hash ourselves with whatever hashing algorithm we want. This would be more computationally expensive but probably not noticeable in practice.

But with your suggestion we could instead re-use git's hashing mechanism. I don't know enough about Git but I presume the hashes of previous versions of these files are also stored somewhere in the .git so we wouldn't need to pay such a high price to re-produce them as hashing them ourselves?

Yes, that would actually be ideal. Like most developers these days I'm pretty heavily invested in Git so that was the use case I had in mind, but making the entire checksum system pluggable would make it possible to implement within any VCS (or none). The default provider could be the Git one talked about here while leaving room for some other way of drumming up checksums that serve a specific purpose.

That sounds very good to me!

Hence my suggestion was to only store one checksum. In the case of Git that one value can be the object hash and used to identify (and retrieve) the exact file contents before or after committing and also to look up the commit history for it.

Brilliant :slightly_smiling_face:

Edit wrong emoji!

kellpossible commented 11 months ago

Edit previous post: wrong emoji!

alerque commented 11 months ago

As far as making this pluggable goes, we probably want to store what the hash scheme is along with the hash. That way tooling will know what VCS/hash system to use:

{ "primary-version": { "scheme": "git-object", "hash": "119142f1bf27dcb9e059495206c64c404db90af4" }}

Or like passwd databases, htpasswd tables and others do and define a scheme for the value that includes the hash type along with the hash:

{ "primary-version": "git-object#119142f1bf27dcb9e059495206c64c404db90af4" }

I'd go for the former myself, but then I'd also avoid JSON like the plague if I had my druthers. Either way it's the same information with different parsing trade-offs. Its still better than storing two hashes, one of them potentially one step out of date and the other being hard to retrieve for old versions.

alerque commented 11 months ago

I would propose that we have an associated json file per translated file, or perhaps a single json file per language with a subsection per file. Within this file, it contains an entry per message in the translated files with the following data:

The message key.

The git commit hash for a version of the associated primary language file.

A SHA256 hash of the primary file.

What about forgoing the external JSON data and attaching this meta information directly to translations with some predefined format in a code comment in the translation file itself? This would be a trade off in pain points of course, but it would mean you didn't have to store the message key separately at all, didn't have to worry about tracking it separately, could be more human readable than a separate file with just meta information, etc.

I suppose the major trade off is it would make it harder to mix and match with other tooling that was not aware of the meaning in the comments and might blow away the comments altogether. I don't know how common that is. For Fluent the language specs a type of comment that stays attached to messages so tooling should support such a use case already. For gettext things are a bit more ad-hoc (although personally I don't care, Fluent being the only current sane localization system in the space right now with MF2 being a potential future peer).

kellpossible commented 11 months ago

I'd go for the former myself, but then I'd also avoid JSON like the plague if I had my druthers. Either way it's the same information with different parsing trade-offs. Its still better than storing two hashes, one of them potentially one step out of date and the other being hard to retrieve for old versions.

I agree, the first example you listed looks good, I prefer the denormalized format too. I also agree JSON definitely has its problems, but at least the universality of tooling for it means that people are less likely to be worried about being locked into using our system, and they can easily manipulate it in automated workflows we haven't yet conceived of.

What about forgoing the external JSON data and attaching this meta information directly to translations with some predefined format in a code comment in the translation file itself? This would be a trade off in pain points of course, but it would mean you didn't have to store the message key separately at all, didn't have to worry about tracking it separately, could be more human readable than a separate file with just meta information, etc.

I did consider this, and it's a great idea, but I am assuming many parsing libraries for formats like these don't really support the editing workflow very well (I could be wrong and it would be worth validating this assumption), especially not without reformatting the entire file which may mess with version control and this change tracking system itself. I suppose this is something we will have to grapple with later if we want to have a GUI for editing translations, but I wouldn't want to limit this change tracking to systems that have a suitable parsing library. Also if we go with an external format, then hopefully the implementation on our side across different formats will be more consistent.

For the embedded use case where I plan to support formats like json/yaml key/value (that could get transformed into some slim binary format), we perhaps don't have the capability to attach this metadata to the messages and would have to rely on an external file anyway.

What do you think? I'm appreciating having someone to bounce these ideas with!

kellpossible commented 10 months ago

To add to this proposal I would like to try to make the change tracking generic over the source of messages, and also have a plugable storage backend. I don't currently have a personal use case to motivate me to do this work because https://github.com/kellpossible/avalanche-report has been granted an open source license for crowdin to translate fluent messages, however we soon will need to have a way to translate user generated long form content, so ideally the tools to track translations of this content can be made into a library that can be used in a web server, so if I can combine the two somehow it may be easier for me to be motivated to finish it.