Closed snomos closed 1 year ago
Sounds fine to me. Packaging-wise, dependencies are never optional, so the optional part is for y'all to figure out. And splitting the package is simple enough, especially if the repo is also split.
For dependency handling, there are at least the following alternatives:
gut
- it is cross-platform, developed for the GiellaLT infra, and already in use. It has the benefit of an existing config file under version control (in .gut/delta.toml
), and it would be easy to extend that with a new config file in the same dir, for specifying dependencies. gut
is also made for handling git repos, so cloning missing deps should be very easy to add support forThere is of course the possibility to do some shell scripting, but that will tie us even more to Unix, while we should strive to become more platform agnostic.
There is of course the possibility to do some shell scripting, but that will tie us even more to Unix, while we should strive to become more platform agnostic.
A bit of a meta question, but why still bother? WSL exists and is great, and they're working on WSLg. MSYS is also fine. Even Microsoft's own packaging system vcpkg will download MSYS for several uses.
I see no benefit in catering to non-Unix these days. Windows 7, 8, and 8.1 are EOL, so all supported platforms either comes with or can install a full Posix/Unix environment with ease.
the shared repos are now there and some use cases in langs-smj and myv (for urj-Cyrl).
Guessing some CI stuff should be adapted, as seen here, shared repos are missing. https://divvun-tc.thetc.se/tasks/PqgV9La7Toi1onN0N9OPKw/runs/0/logs/public/logs/live.log#L6809
How do we handle that? I would rather have a generic way that wouldn't require lists of dependency per language in the CI config. Of course, the easy solution for now is to clone all of the shared repos but it'd be nice to have a way to avoid that in the future
yeah we were thinking of some lightweight format for dependencies in style of requirements.txt or Cargo.toml somewhere, not sure if that will be easier for CI than what can be fetched from configure.ac at the moment though, I'm open for ideas at this point
It has to be in configure.ac
at minimum, ideally with pkg-config
figuring out location, with a fallback for those who have it in a parallel folder and an override for those who have it a 3rd place.
Edit: And I see that's almost what it is. Just missing a --with-shared=location
override.
For now I've implemented the "clone everything every time" solution for CI.
The builders get packages from my repo, so when I package the shared deps then that's a way to get them. Which I have to do anyway, so I'll get on that right away.
Edit: And I see that's almost what it is. Just missing a
--with-shared=location
override.
I added the with options, it seems to work nicely with dynamic variables but isn't very extensively tested
Part of the idea was to generalize shared resources to also include an arbitrary list of lang-xxx
repos. For this to work without having to clone or download every shared-xxx
and lang-xxx
repository we need some sort of (a simple) dependency management, as mentioned by @flammie above.
The details and implementation is not important, but it needs to meet the following details:
autogen.sh
(so that new users can get all required deps by running the script)I have played around with something using toml
files, but it hasn't been as easy and simple as I would like to. Suggestions very welcome.
I've packaged the shared repos and tried it out with giella-smj and that works (builds, but fails tests), but the giella-sme
and giella-sma
packages do not install the needed source files - they only install the compiled binaries.
checking whether we can use shared smi... /usr/share/shared-smi
checking whether we can use shared sme... false
configure: WARNING: Could not find lang-sme data dir to set sme
checking whether we can use shared sma... false
configure: WARNING: Could not find lang-sma data dir to set sma
New apt-get packages:
https://github.com/giellalt/giella-core/blob/master/scripts/giellalt-get.bash here's liek a rough sketchy example of how one could automatically detect and fetch dependencies...
yeah the installation of sharables in langs is still missing I'll do that next, basically should work as copypasta from shared makefile.ams. I wonder if pkg-config is also missing since the configure fails so early, the missing installation should only bounce at build time.
indeed the pkg-config name of single langs is giella- instead of lang-
Aha. Well it can't be lang-
, because pkg-config is meant for installed use where the package should be named something meaningful in a void.
I am also not thrilled with the shared packages installing to /usr/share/shared-smi
- missing a giella/
imo, but at least the -smi
prevents conflicts.
mm fair point. It should be trivial to have a macro to check separate pkg config and directory names but I'll hold it off for a while if we can have a consensus on all the naming questions first since the template operations on all repos are quite heavy to run
I am also not thrilled with the shared packages installing to
/usr/share/shared-smi
- missing agiella/
imo ...
I agree.
On the other hand, should these packages be installed at all? Most/all of them are more like code fragments to be included early in the build phase, not precompiled binaries or libraries to be "linked" to at runtime. Just my humble five cents 🙂
Part of the idea was to generalize shared resources to also include an arbitrary list of
lang-xxx
repos. For this to work without having to clone or download everyshared-xxx
andlang-xxx
repository we need some sort of (a simple) dependency management, as mentioned by @flammie above.The details and implementation is not important, but it needs to meet the following details:
- easily run-able from a shell script like
autogen.sh
(so that new users can get all required deps by running the script)- easily call-able from various CI systems
I have played around with something using
toml
files, but it hasn't been as easy and simple as I would like to. Suggestions very welcome.
What about making the deps list very simple, like a TSV file of the following format:
# Comment
# name org/repo revision/tag
shared-smi giellalt/shared-smi 0.3
where the tag/revision field is optional, and if left out, it means HEAD
? The intention is to make it as easy as possible to parse the list in whatever context, and make sure the listed deps are available. Everything stored in a file named e.g. Dependencies.tsv
in the root dir of the project. WDYT?
For distro packaging you need to either install separately or bundle into a single tarball. The option of fetching the extras at build time or into a parallel folder does not exist. The fact that giella-core's m4 files must be bundled is a bit of a pain.
I use both options. For nightly packages, the installed data dependencies are used because here we always want the latest of everything. For releases, data dependencies are bundled into the tarball because version drift will ruin things.
Which reminds me, it is important for those dependencies to note in configure.ac
which tagged version they should require for releases. In Apertium we do this by a 3rd optional arg AP_CHECK_LING([1], [apertium-dan], [0.6.1])
. This is an exact version - it won't accept older or newer.
And for listing deps, I'd say configure.ac
is sufficient. It's easy enough to grep gt_USE_SHARED configure.ac
and parse those lines with a trivial regex.
I am also not thrilled with the shared packages installing to
/usr/share/shared-smi
- missing agiella/
imo ...I agree.
On the other hand, should these packages be installed at all? Most/all of them are more like code fragments to be included early in the build phase, not precompiled binaries or libraries to be "linked" to at runtime. Just my humble five cents 🙂
It's a bit like header only libraries in C/C++ in a way, but yeah like Tino says it's good for packaging and distro use and comes quite for free in autotools setting.
I've gone through the naming convention questions a bit, so the questions to agree upon are:
I say
pkg-config:
repo names:
installation folder:
file names:
I say
- pkg-config:
- languages: giella-qtz
- shared: giella-shared-smi
- repo names:
- Fine as is, no need for giella prefix. Repo does not need to match install or package name.
- installation folder:
- languages: $prefix/share/giella/qtz
- shared: $prefix/share/giella/shared-smi
- file names:
- Doesn't matter for packaging, as long as root folders are good. You can maintain the same file and folder structure in source and install, which should simplify things.
I agree with all of this.
it should be good for testing now, the ci that reports on zulip seems to succeed but there are probably number of corner cases that can fail still.
Seems to work.
The only thing I would like to improve with the new shared repos is the date of the commits. They seem to now be from the date that @flammie did the split, and not from the actual date of the commit. Could that be fixed? Also, the history does not go all the way back to the start, but that could be a left-over thing from the svn-to-git conversion.
I used this: https://stackoverflow.com/questions/1365541/how-to-move-some-files-from-one-git-repo-to-another-not-a-clone-preserving-hi/11426261#11426261 to do the history, the commit dates look right to me on command line git but github seems to have different timing, maybe the --commiter-date-is-author-date option was wrong? This method is also nice because you can basically sed the log for anomalies that break the thing like massive moves.
This is what it looks like in Tower, where the details reveal what is going wrong:
That is, I am the author (and @flammie the committer). It seems that Tower (and GitHub) uses the committer date (May 2022), whereas the CLI log uses the author date (2017). Ideally I would like committer = author (unless there is a real (PR) merge, which I don't think we've had so far for this repo or the parent repo), both regarding the person and the date.
Finally, we should also make sure that the history is complete. That is tracked in a separate project.
It's trivial to fix since the author info is there and correct, but it will mean a force push to the repos.
Howto: https://riptutorial.com/git/example/21122/setting-git-committer-equal-to-commit-author
Done using this command for shared-smi
(worked very well):
git filter-branch -f --commit-filter \
'export GIT_COMMITTER_NAME=\"$GIT_AUTHOR_NAME\";
export GIT_COMMITTER_EMAIL=\"$GIT_AUTHOR_EMAIL\";
export GIT_COMMITTER_DATE=\"$GIT_AUTHOR_DATE\";
git commit-tree $@' \
-- --all
and force-pushed. Will also do the other repos, and finally clean up some wrong emails. Ie more force-pushing coming up.
Just for reference, emails are cleaned using this command:
git filter-branch --env-filter 'if [ "$GIT_AUTHOR_EMAIL" = "incorrect@email" ]; then
GIT_AUTHOR_EMAIL=correct@email;
GIT_AUTHOR_NAME="Correct Name";
GIT_COMMITTER_EMAIL=$GIT_AUTHOR_EMAIL;
GIT_COMMITTER_NAME="$GIT_AUTHOR_NAME"; fi' -- --all
taken from https://stackoverflow.com/questions/4981126/how-to-amend-several-commits-in-git-to-change-author.
After the above, I filtered shared-mul
using git filter-branch
to get rid of src/filters/
(they are in giella-core
now), and reordered one commit to get it into the correct order. Now all the dates are mangled again, but still with correct author date recorded. Only problem is, when I try to run the command above now, it ends with this error message:
Rewrite b5fcc6dba1a3915844f8641575bab14490229e5a (62/70) (3 seconds passed, remaining 0 predicted)
Ref 'refs/heads/main' was deleted
fatal: Not a valid object name HEAD
zsh: command not found: export GIT_COMMITTER_NAME=\"$GIT_AUTHOR_NAME\";\n export GIT_COMMITTER_EMAIL=\"$GIT_AUTHOR_EMAIL\";\n export GIT_COMMITTER_DATE=\"$GIT_AUTHOR_DATE\";\n git commit-tree $@
and the repo is busted. Of course one can live with wrong dates, it is just very irritating. Anyone any idea on how to fix the repo so that the dates are correct again?
You somehow got 11 spaces after git filter-branch -f --commit-filter \
so it was git filter-branch -f --commit-filter \...........
(spaces shown as dots) and this ruined the paste. I've edited your comment to remove the spaces.
Thanks, that fixed it, and now shared-mul
is done.
This is now all done. Dependency management can be refined, but it works across all CI/build systems, which is good enough.
Background
giella-shared
contains today a mixture of data for many different languages:Core idea
Ideally we would only have
giella-core
as a required dependency (thus needing to move the filters there), and everything else as separate repositories that can be subscribed on an as-needed/wanted basis.By generalising sharing resources, it would also be straightforward to share content across language repositories, like including
sma
andsme
proper nouns insmj
(with some filtering and restrictions). Technically there would be no difference between getting content fromlang-sme
andshared-smi
.Naming
shared-
, parallel tolang-
,keyboard-
etc. It does not have to be what is suggested here, other suggestions are welcome.smi
andurj
Concrete example
The present
giella-shared
would after a split become (with check marks for the actual split):shared-smi
: the present shared Sámi resourcesshared-mul
: the present shared symbols, url's and punctuation lexicons (mul
= multiple languages)shared-eng
: present shared English resources (like names)shared-urj-Cyrl
: shared resources for Uralic languages written in Cyrillicgiella-core/fst-filters/
: fst filters moved here, since they are a prerequisite for compiling fst'sAnother example:
lang-sme
as a source for North Sámi names when used in another Sámi language, like place names. Non-Sámi names inlang-sme
would be filtered out, and generic last elements could be (automatically) adapted to Lule Sámi spelling and inflection as needed. This is relevant both for text analysis and parsing in general, but especially for TTS, where there is a need to get a best possible transcription and pronunciation of whatever is thrown at the system. Place names from related neightbouring languages will certainly be a pain point for many minority languages in such a context.By treating all repos the same as a potential source for lexical and other resources, we get a more flexible and powerful infrastructure.
Restrictions
Ideally the shared resources should never be required — without access to them the result should only be a smaller analyser with worse coverage. This will make
giella-core
the only required external dependency.As far as possible, the resources in each repo should be independently compilable and testable, kind of like independent code libraries.
Benefits
Considerations
versioning
dependency management
We need a straightforward and simple system to declare dependency on a list of other repositories, kind of like Rust cargo lists. But as noted above, the system should be robust enough to not break if a resource is not available, only give a warning.
CI
Dependency management needs to be automatic, at least for CI systems. We need at least:
configure.ac
, at least for now./autogen.sh
in a directory, using the same cloning scheme as the depending repo — svn, git-ssh or git-https)Cleanup
Comments welcome!
@flammie and I discussed this today, the notes above are based on that. We would very much like feedback on these ideas from anyone, but especially from @TinoDidriksen @bbqsrc @Eijebong @Trondtr @aarppe