Closed danieldk closed 5 years ago
Everything in one place sounds nice.
GitHub releases has size limitations (2GB per file). We do not hit them anymore since I have started quantizing embeddings, but it may prove to be annoying in the future.
We distribute embeddings via GitHub releases?
GitHub is been finicky with downloads before. Before Releases they had a download feature, which they suddenly cancelled and there was no download option between this cancellation and when they introduced Releases.
Probably a good idea to keep track of our releases locally too.
GitHub releases has size limitations (2GB per file). We do not hit them anymore since I have started quantizing embeddings, but it may prove to be annoying in the future.
We distribute embeddings via GitHub releases?
No, currently though https://blob.danieldk.eu/ . These are used in Nix builds and the Docker images. But we could.
GitHub is been finicky with downloads before. Before Releases they had a download feature, which they suddenly cancelled and there was no download option between this cancellation and when they introduced Releases.
Probably a good idea to keep track of our releases locally too.
But the larger problem is that they could break reproducibility if they cancelled large file releases or changed the URLs. We can currently exactly reproduce sticker + models + dependencies by pinning to a nixpkgs commit and a commit from my Nix package set. E.g.
nix-build \
-I nixpkgs=https://github.com/NixOS/nixpkgs/archive/e6ad5e75f3bfaab5e7b7f0f128bf13d534879e65.tar.gz \
https://git.sr.ht/~danieldk/nix-packages/archive/c7a09d0e3720f81e392db6d5d2509840963c07a7.tar.gz \
-A dockerImages.sticker.de-ner-ud
Will give you the exact NER model, sticker version, Tensorflow version, and glibc version that was used for the NER model submitted to CLARIN-D. Even if you build in in, say, five years. But of course, this hinges on the URLs being available down the line (anything outside sticker + our models is unproblematic, since the Nix's public binary caches cache everything from nixpkgs).
No, currently though https://blob.danieldk.eu/ . These are used in Nix builds and the Docker images. But we could.
I wouldn't expect to find pre-trained models on the releases page of a GH repository.
But the larger problem is that they could break reproducibility if they canceled large file releases or changed the URLs. ... Will give you the exact NER model, sticker version, Tensorflow version, and glibc version that was used for the NER model submitted to CLARIN-D. Even if you build in in, say, five years. But of course, this hinges on the URLs being available down the line (anything outside sticker + our models is unproblematic, since the Nix's public binary caches cache everything from nixpkgs).
If we keep a local copy while centralizing to GitHub, exact step-by-step reproducibility may break and someone with access to the local copy is needed to make the models available again. If the links break, we cannot repair them unless we control them, moving to GH & keeping this kind of control may require a redirection layer which we could point somewhere else if the store is gone. Given that we may work on other things by that time, it's probably a good thing to have storage where we can guarantee persistence.
I wouldn't expect to find pre-trained models on the releases page of a GH repository.
E.g. the spaCy people do this. Of course, this would not be the primary interface for users. They could use the pre-built Docker images or Nix derivations. We could even make standalone binaries that include all dependencies, including the model + embedfings. For the Python module, we could add a small loader API that would fetch the model.
If we keep a local copy while centralizing to GitHub, exact step-by-step reproducibility may break and someone with access to the local copy is needed to make the models available again.
Indeed. And I think this kind of reproducibility would be nice to preserve. Because then a paper could report the exact tagger, parser, etc. that was used for preprocessing.
If the links break, we cannot repair them unless we control them, moving to GH & keeping this kind of control may require a redirection layer which we could point somewhere else if the store is gone.
Yes, we'd need some redirection from a URL under our control.
Given that we may work on other things by that time, it's probably a good thing to have storage where we can guarantee persistence.
Probably the ideal solution would be to deposit models in a CLARIN repository and get PIDs for the files.
Just realized that there may be a nice solution to increase redundancy. Nix derivations are currently used for reproducible builds and the Docker containers. Nix's fetchurl
function allows for the specification of multiple URLs for fallback. So, e.g.
{
src = fetchurl {
urls = [ "https://models.danieldk.eu/sticker/nl-pos-ud-20190822.tar.gz"
"https://github.com/whatever/sticker-models/releases/download/nl-pos-ud-20190822/nl-pos-ud-20190822.tar.gz" ];
sha256 = "0ywa0kmpsh1cmdcl4ya0q67wcjq4m6g2n79a1kjgrqmhydc7d59p";
};
}
If my server goes down, it falls back to GitHub. If GitHub releases breaks, I could just redirect the URI. If some evil entity gets control over my domain, they couldn't provide bogus data since the expected hash is listed.
Fixed, as can be seen in the URL :). I used stickeritis, as sticker already exists. Once I have done a bit more planning, I'll also create the sticker-models repo.
We currently have
sticker
andsticker-python
. I have also been pondering whether I should move the models fromblob.danieldk.eu
to asticker-models
repo (where the repo would contain metadata and the associated releases would store the models).In favor:
Against:
sticker
is already used. So we'd need an organization with a different name (I already reserver glumarko, which is Esperanto for 'sticker').