snomos commented 2 years ago

Background

giella-shared contains today a mixture of data for many different languages:

giella-shared/
├── all_langs
│   └── src
│       ├── filters ⇒ obligatory, move to giella-core?
│       └── fst ⇒ url, punctuation, symbols
├── eng ⇒ names for languages in English majority countries
├── smi ⇒ names, cg functions and dependency graphs mainly for Sámi languages
└── urj-Cyrl ⇒ names for Uralic languages written in Cyrillic

Core idea

Ideally we would only have giella-core as a required dependency (thus needing to move the filters there), and everything else as separate repositories that can be subscribed on an as-needed/wanted basis.

By generalising sharing resources, it would also be straightforward to share content across language repositories, like including sma and sme proper nouns in smj (with some filtering and restrictions). Technically there would be no difference between getting content from lang-sme and shared-smi.

Naming

using a prefix shared-, parallel to lang-, keyboard- etc. It does not have to be what is suggested here, other suggestions are welcome.
followed by a BCP 47 like locale tag, but also allowing language family tags such as smi and urj

Concrete example

The present giella-shared would after a split become (with check marks for the actual split):

[x] shared-smi: the present shared Sámi resources
[x] shared-mul: the present shared symbols, url's and punctuation lexicons (mul = multiple languages)
[x] shared-eng: present shared English resources (like names)
[x] shared-urj-Cyrl: shared resources for Uralic languages written in Cyrillic
[x] giella-core/fst-filters/: fst filters moved here, since they are a prerequisite for compiling fst's

Another example:

using lang-sme as a source for North Sámi names when used in another Sámi language, like place names. Non-Sámi names in lang-sme would be filtered out, and generic last elements could be (automatically) adapted to Lule Sámi spelling and inflection as needed. This is relevant both for text analysis and parsing in general, but especially for TTS, where there is a need to get a best possible transcription and pronunciation of whatever is thrown at the system. Place names from related neightbouring languages will certainly be a pain point for many minority languages in such a context.

By treating all repos the same as a potential source for lexical and other resources, we get a more flexible and powerful infrastructure.

Restrictions

Ideally the shared resources should never be required — without access to them the result should only be a smaller analyser with worse coverage. This will make giella-core the only required external dependency.

As far as possible, the resources in each repo should be independently compilable and testable, kind of like independent code libraries.

Benefits

more flexibility
only use what is needed for a language, and start small and simple
still access to all sorts of premade resources for various purposes
easier version tagging of each shared resource
with each repo containing a more clearly defined and limited set of data, it is easier to document, specify and reuse

Considerations

versioning

should one always asume latest code
or should it be possible to peg the inclusion to a specific version

dependency management

We need a straightforward and simple system to declare dependency on a list of other repositories, kind of like Rust cargo lists. But as noted above, the system should be robust enough to not break if a resource is not available, only give a warning.

CI

Dependency management needs to be automatic, at least for CI systems. We need at least:

[x] simple dependency specification \ Covered by what is specified in configure.ac, at least for now
[x] add routines to make sure that the dependencies are available in the following cases:
- [x] when starting work (ie by running ./autogen.sh in a directory, using the same cloning scheme as the depending repo — svn, git-ssh or git-https)
- [x] during CI (Taskcluster, GH Actions, Tino's build machine)

Cleanup

[x] Remove giella-shared when everything has stabilized (incl. ensuring the full history of the content of the new repos is retained, cf this task)

Comments welcome!

@flammie and I discussed this today, the notes above are based on that. We would very much like feedback on these ideas from anyone, but especially from @TinoDidriksen @bbqsrc @Eijebong @Trondtr @aarppe

TinoDidriksen commented 2 years ago

Sounds fine to me. Packaging-wise, dependencies are never optional, so the optional part is for y'all to figure out. And splitting the package is simple enough, especially if the repo is also split.

snomos commented 2 years ago

For dependency handling, there are at least the following alternatives:

using gut - it is cross-platform, developed for the GiellaLT infra, and already in use. It has the benefit of an existing config file under version control (in .gut/delta.toml), and it would be easy to extend that with a new config file in the same dir, for specifying dependencies. gut is also made for handling git repos, so cloning missing deps should be very easy to add support for
using the páhkat CLI - it would need some more development, but that is needed anyway, and it is built to handle and install packages and dependencies

There is of course the possibility to do some shell scripting, but that will tie us even more to Unix, while we should strive to become more platform agnostic.

TinoDidriksen commented 2 years ago

There is of course the possibility to do some shell scripting, but that will tie us even more to Unix, while we should strive to become more platform agnostic.

A bit of a meta question, but why still bother? WSL exists and is great, and they're working on WSLg. MSYS is also fine. Even Microsoft's own packaging system vcpkg will download MSYS for several uses.

I see no benefit in catering to non-Unix these days. Windows 7, 8, and 8.1 are EOL, so all supported platforms either comes with or can install a full Posix/Unix environment with ease.

flammie commented 2 years ago

the shared repos are now there and some use cases in langs-smj and myv (for urj-Cyrl).

Eijebong commented 2 years ago

Guessing some CI stuff should be adapted, as seen here, shared repos are missing. https://divvun-tc.thetc.se/tasks/PqgV9La7Toi1onN0N9OPKw/runs/0/logs/public/logs/live.log#L6809

How do we handle that? I would rather have a generic way that wouldn't require lists of dependency per language in the CI config. Of course, the easy solution for now is to clone all of the shared repos but it'd be nice to have a way to avoid that in the future

flammie commented 2 years ago

yeah we were thinking of some lightweight format for dependencies in style of requirements.txt or Cargo.toml somewhere, not sure if that will be easier for CI than what can be fetched from configure.ac at the moment though, I'm open for ideas at this point

TinoDidriksen commented 2 years ago

It has to be in configure.ac at minimum, ideally with pkg-config figuring out location, with a fallback for those who have it in a parallel folder and an override for those who have it a 3rd place.

Edit: And I see that's almost what it is. Just missing a --with-shared=location override.

Eijebong commented 2 years ago

For now I've implemented the "clone everything every time" solution for CI.

TinoDidriksen commented 2 years ago

The builders get packages from my repo, so when I package the shared deps then that's a way to get them. Which I have to do anyway, so I'll get on that right away.

flammie commented 2 years ago

Edit: And I see that's almost what it is. Just missing a --with-shared=location override.

I added the with options, it seems to work nicely with dynamic variables but isn't very extensively tested

snomos commented 2 years ago

Part of the idea was to generalize shared resources to also include an arbitrary list of lang-xxx repos. For this to work without having to clone or download every shared-xxx and lang-xxx repository we need some sort of (a simple) dependency management, as mentioned by @flammie above.

The details and implementation is not important, but it needs to meet the following details:

easily run-able from a shell script like autogen.sh (so that new users can get all required deps by running the script)
easily call-able from various CI systems

I have played around with something using toml files, but it hasn't been as easy and simple as I would like to. Suggestions very welcome.

TinoDidriksen commented 2 years ago

I've packaged the shared repos and tried it out with giella-smj and that works (builds, but fails tests), but the giella-sme and giella-sma packages do not install the needed source files - they only install the compiled binaries.

checking whether we can use shared smi... /usr/share/shared-smi
checking whether we can use shared sme... false
configure: WARNING: Could not find lang-sme data dir to set sme
checking whether we can use shared sma... false
configure: WARNING: Could not find lang-sma data dir to set sma

New apt-get packages:

giella-shared-eng
giella-shared-mul
giella-shared-smi
giella-shared-urj-cyrl

flammie commented 2 years ago

https://github.com/giellalt/giella-core/blob/master/scripts/giellalt-get.bash here's liek a rough sketchy example of how one could automatically detect and fetch dependencies...

yeah the installation of sharables in langs is still missing I'll do that next, basically should work as copypasta from shared makefile.ams. I wonder if pkg-config is also missing since the configure fails so early, the missing installation should only bounce at build time.

flammie commented 2 years ago

indeed the pkg-config name of single langs is giella- instead of lang-

TinoDidriksen commented 2 years ago

Aha. Well it can't be lang-, because pkg-config is meant for installed use where the package should be named something meaningful in a void.

I am also not thrilled with the shared packages installing to /usr/share/shared-smi - missing a giella/ imo, but at least the -smi prevents conflicts.

flammie commented 2 years ago

mm fair point. It should be trivial to have a macro to check separate pkg config and directory names but I'll hold it off for a while if we can have a consensus on all the naming questions first since the template operations on all repos are quite heavy to run

snomos commented 2 years ago

I am also not thrilled with the shared packages installing to /usr/share/shared-smi - missing a giella/ imo ...

I agree.

On the other hand, should these packages be installed at all? Most/all of them are more like code fragments to be included early in the build phase, not precompiled binaries or libraries to be "linked" to at runtime. Just my humble five cents 🙂

snomos commented 2 years ago

Part of the idea was to generalize shared resources to also include an arbitrary list of lang-xxx repos. For this to work without having to clone or download every shared-xxx and lang-xxx repository we need some sort of (a simple) dependency management, as mentioned by @flammie above.

The details and implementation is not important, but it needs to meet the following details:

easily run-able from a shell script like autogen.sh (so that new users can get all required deps by running the script)

easily call-able from various CI systems

I have played around with something using toml files, but it hasn't been as easy and simple as I would like to. Suggestions very welcome.

What about making the deps list very simple, like a TSV file of the following format:

# Comment
# name  org/repo    revision/tag
shared-smi  giellalt/shared-smi 0.3

where the tag/revision field is optional, and if left out, it means HEAD? The intention is to make it as easy as possible to parse the list in whatever context, and make sure the listed deps are available. Everything stored in a file named e.g. Dependencies.tsv in the root dir of the project. WDYT?

TinoDidriksen commented 2 years ago

For distro packaging you need to either install separately or bundle into a single tarball. The option of fetching the extras at build time or into a parallel folder does not exist. The fact that giella-core's m4 files must be bundled is a bit of a pain.

I use both options. For nightly packages, the installed data dependencies are used because here we always want the latest of everything. For releases, data dependencies are bundled into the tarball because version drift will ruin things.

Which reminds me, it is important for those dependencies to note in configure.ac which tagged version they should require for releases. In Apertium we do this by a 3rd optional arg AP_CHECK_LING([1], [apertium-dan], [0.6.1]). This is an exact version - it won't accept older or newer.

And for listing deps, I'd say configure.ac is sufficient. It's easy enough to grep gt_USE_SHARED configure.ac and parse those lines with a trivial regex.

flammie commented 2 years ago

I am also not thrilled with the shared packages installing to /usr/share/shared-smi - missing a giella/ imo ...

I agree.

On the other hand, should these packages be installed at all? Most/all of them are more like code fragments to be included early in the build phase, not precompiled binaries or libraries to be "linked" to at runtime. Just my humble five cents 🙂

It's a bit like header only libraries in C/C++ in a way, but yeah like Tino says it's good for packaging and distro use and comes quite for free in autotools setting.

I've gone through the naming convention questions a bit, so the questions to agree upon are:

pkg-config names:
- languages: giella-qtz
- shared: shared-smi (or: giella-shared-smi?)
repo names:
- languages: lang-qtz (should it be giella-qtz?)
- shared: shared-smi (giella-shared-smi?)
installation directory:
- languages: $prefix/share/giella/qtz or $prefix/share/giella-qtz?
- shared: $prefix/share/giella/shared-smi or $prefix/share/shared-smi or ...
file names:
- some files are like: stems/abbreviations.lexc
- some are like: stems/sme-propernouns.lexc
- some converted target files are: generated_files/sme-smj-propernouns.lexc
- also multiple target rules of form generated_files/%.lexc don't work

TinoDidriksen commented 2 years ago

I say

pkg-config:
- languages: giella-qtz
- shared: giella-shared-smi
repo names:
- Fine as is, no need for giella prefix. Repo does not need to match install or package name.
installation folder:
- languages: $prefix/share/giella/qtz
- shared: $prefix/share/giella/shared-smi
file names:
- Doesn't matter for packaging, as long as root folders are good. You can maintain the same file and folder structure in source and install, which should simplify things.

snomos commented 2 years ago

I say

pkg-config:

languages: giella-qtz

shared: giella-shared-smi

repo names:

Fine as is, no need for giella prefix. Repo does not need to match install or package name.

installation folder:

languages: $prefix/share/giella/qtz

shared: $prefix/share/giella/shared-smi

file names:

Doesn't matter for packaging, as long as root folders are good. You can maintain the same file and folder structure in source and install, which should simplify things.

I agree with all of this.

flammie commented 2 years ago

it should be good for testing now, the ci that reports on zulip seems to succeed but there are probably number of corner cases that can fail still.

TinoDidriksen commented 2 years ago

Seems to work.

snomos commented 2 years ago

The only thing I would like to improve with the new shared repos is the date of the commits. They seem to now be from the date that @flammie did the split, and not from the actual date of the commit. Could that be fixed? Also, the history does not go all the way back to the start, but that could be a left-over thing from the svn-to-git conversion.

flammie commented 2 years ago

I used this: https://stackoverflow.com/questions/1365541/how-to-move-some-files-from-one-git-repo-to-another-not-a-clone-preserving-hi/11426261#11426261 to do the history, the commit dates look right to me on command line git but github seems to have different timing, maybe the --commiter-date-is-author-date option was wrong? This method is also nice because you can basically sed the log for anomalies that break the thing like massive moves.

snomos commented 2 years ago

This is what it looks like in Tower, where the details reveal what is going wrong:

That is, I am the author (and @flammie the committer). It seems that Tower (and GitHub) uses the committer date (May 2022), whereas the CLI log uses the author date (2017). Ideally I would like committer = author (unless there is a real (PR) merge, which I don't think we've had so far for this repo or the parent repo), both regarding the person and the date.

Finally, we should also make sure that the history is complete. That is tracked in a separate project.

TinoDidriksen commented 2 years ago

It's trivial to fix since the author info is there and correct, but it will mean a force push to the repos.

Howto: https://riptutorial.com/git/example/21122/setting-git-committer-equal-to-commit-author

snomos commented 2 years ago

Done using this command for shared-smi (worked very well):

git filter-branch -f --commit-filter \
   'export GIT_COMMITTER_NAME=\"$GIT_AUTHOR_NAME\";
    export GIT_COMMITTER_EMAIL=\"$GIT_AUTHOR_EMAIL\";
    export GIT_COMMITTER_DATE=\"$GIT_AUTHOR_DATE\";
    git commit-tree $@' \
    -- --all

and force-pushed. Will also do the other repos, and finally clean up some wrong emails. Ie more force-pushing coming up.

snomos commented 2 years ago

Just for reference, emails are cleaned using this command:

git filter-branch --env-filter 'if [ "$GIT_AUTHOR_EMAIL" = "incorrect@email" ]; then
     GIT_AUTHOR_EMAIL=correct@email;
     GIT_AUTHOR_NAME="Correct Name";
     GIT_COMMITTER_EMAIL=$GIT_AUTHOR_EMAIL;
     GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME"; fi' -- --all

taken from https://stackoverflow.com/questions/4981126/how-to-amend-several-commits-in-git-to-change-author.

snomos commented 2 years ago

After the above, I filtered shared-mul using git filter-branch to get rid of src/filters/ (they are in giella-core now), and reordered one commit to get it into the correct order. Now all the dates are mangled again, but still with correct author date recorded. Only problem is, when I try to run the command above now, it ends with this error message:

Rewrite b5fcc6dba1a3915844f8641575bab14490229e5a (62/70) (3 seconds passed, remaining 0 predicted)    
Ref 'refs/heads/main' was deleted
fatal: Not a valid object name HEAD
zsh: command not found: export GIT_COMMITTER_NAME=\"$GIT_AUTHOR_NAME\";\n    export GIT_COMMITTER_EMAIL=\"$GIT_AUTHOR_EMAIL\";\n    export GIT_COMMITTER_DATE=\"$GIT_AUTHOR_DATE\";\n    git commit-tree $@

and the repo is busted. Of course one can live with wrong dates, it is just very irritating. Anyone any idea on how to fix the repo so that the dates are correct again?

TinoDidriksen commented 2 years ago

You somehow got 11 spaces after git filter-branch -f --commit-filter \ so it was git filter-branch -f --commit-filter \........... (spaces shown as dots) and this ruined the paste. I've edited your comment to remove the spaces.

snomos commented 2 years ago

Thanks, that fixed it, and now shared-mul is done.

snomos commented 1 year ago

This is now all done. Dependency management can be refined, but it works across all CI/build systems, which is good enough.

giellalt / giella-core

Split giella-shared in several repos #20

Background

Core idea

Naming

Concrete example

Restrictions

Benefits

Considerations

versioning

dependency management

CI

Cleanup

Comments welcome!