Consider moving the sticker repositories to an organization

danieldk commented 5 years ago

We currently have sticker and sticker-python. I have also been pondering whether I should move the models from blob.danieldk.eu to a sticker-models repo (where the repo would contain metadata and the associated releases would store the models).

In favor:

All sticker-related projects are grouped together.
Hosting models on GitHub may prove to be more reliable.

Against:

Even stronger coupling to Microsoft GitHub ;). (sticker started as a project that was purely on https://sourcehut.org/ , which was kinda nice, because sr.ht is open source + has far nice/better CI.)
GitHub releases has size limitations (2GB per file). We do not hit them anymore since I have started quantizing embeddings, but it may prove to be annoying in the future.
GitHub is been finicky with downloads before. Before Releases they had a download feature, which they suddenly cancelled and there was no download option between this cancellation and when they introduced Releases.
sticker is already used. So we'd need an organization with a different name (I already reserver glumarko, which is Esperanto for 'sticker').

twuebi commented 5 years ago

Everything in one place sounds nice.

GitHub releases has size limitations (2GB per file). We do not hit them anymore since I have started quantizing embeddings, but it may prove to be annoying in the future.

We distribute embeddings via GitHub releases?

GitHub is been finicky with downloads before. Before Releases they had a download feature, which they suddenly cancelled and there was no download option between this cancellation and when they introduced Releases.

Probably a good idea to keep track of our releases locally too.

danieldk commented 5 years ago

GitHub releases has size limitations (2GB per file). We do not hit them anymore since I have started quantizing embeddings, but it may prove to be annoying in the future.

We distribute embeddings via GitHub releases?

No, currently though https://blob.danieldk.eu/ . These are used in Nix builds and the Docker images. But we could.

GitHub is been finicky with downloads before. Before Releases they had a download feature, which they suddenly cancelled and there was no download option between this cancellation and when they introduced Releases.

Probably a good idea to keep track of our releases locally too.

But the larger problem is that they could break reproducibility if they cancelled large file releases or changed the URLs. We can currently exactly reproduce sticker + models + dependencies by pinning to a nixpkgs commit and a commit from my Nix package set. E.g.

nix-build \
  -I nixpkgs=https://github.com/NixOS/nixpkgs/archive/e6ad5e75f3bfaab5e7b7f0f128bf13d534879e65.tar.gz \
  https://git.sr.ht/~danieldk/nix-packages/archive/c7a09d0e3720f81e392db6d5d2509840963c07a7.tar.gz \
  -A dockerImages.sticker.de-ner-ud

Will give you the exact NER model, sticker version, Tensorflow version, and glibc version that was used for the NER model submitted to CLARIN-D. Even if you build in in, say, five years. But of course, this hinges on the URLs being available down the line (anything outside sticker + our models is unproblematic, since the Nix's public binary caches cache everything from nixpkgs).

twuebi commented 5 years ago

No, currently though https://blob.danieldk.eu/ . These are used in Nix builds and the Docker images. But we could.

I wouldn't expect to find pre-trained models on the releases page of a GH repository.

But the larger problem is that they could break reproducibility if they canceled large file releases or changed the URLs. ... Will give you the exact NER model, sticker version, Tensorflow version, and glibc version that was used for the NER model submitted to CLARIN-D. Even if you build in in, say, five years. But of course, this hinges on the URLs being available down the line (anything outside sticker + our models is unproblematic, since the Nix's public binary caches cache everything from nixpkgs).

If we keep a local copy while centralizing to GitHub, exact step-by-step reproducibility may break and someone with access to the local copy is needed to make the models available again. If the links break, we cannot repair them unless we control them, moving to GH & keeping this kind of control may require a redirection layer which we could point somewhere else if the store is gone. Given that we may work on other things by that time, it's probably a good thing to have storage where we can guarantee persistence.

danieldk commented 5 years ago

I wouldn't expect to find pre-trained models on the releases page of a GH repository.

E.g. the spaCy people do this. Of course, this would not be the primary interface for users. They could use the pre-built Docker images or Nix derivations. We could even make standalone binaries that include all dependencies, including the model + embedfings. For the Python module, we could add a small loader API that would fetch the model.

If we keep a local copy while centralizing to GitHub, exact step-by-step reproducibility may break and someone with access to the local copy is needed to make the models available again.

Indeed. And I think this kind of reproducibility would be nice to preserve. Because then a paper could report the exact tagger, parser, etc. that was used for preprocessing.

If the links break, we cannot repair them unless we control them, moving to GH & keeping this kind of control may require a redirection layer which we could point somewhere else if the store is gone.

Yes, we'd need some redirection from a URL under our control.

Given that we may work on other things by that time, it's probably a good thing to have storage where we can guarantee persistence.

Probably the ideal solution would be to deposit models in a CLARIN repository and get PIDs for the files.

danieldk commented 5 years ago

Just realized that there may be a nice solution to increase redundancy. Nix derivations are currently used for reproducible builds and the Docker containers. Nix's fetchurl function allows for the specification of multiple URLs for fallback. So, e.g.

{
  src = fetchurl {
    urls = [ "https://models.danieldk.eu/sticker/nl-pos-ud-20190822.tar.gz"
             "https://github.com/whatever/sticker-models/releases/download/nl-pos-ud-20190822/nl-pos-ud-20190822.tar.gz" ];
    sha256 = "0ywa0kmpsh1cmdcl4ya0q67wcjq4m6g2n79a1kjgrqmhydc7d59p";
  };
}

If my server goes down, it falls back to GitHub. If GitHub releases breaks, I could just redirect the URI. If some evil entity gets control over my domain, they couldn't provide bogus data since the expected hash is listed.

danieldk commented 5 years ago

Fixed, as can be seen in the URL :). I used stickeritis, as sticker already exists. Once I have done a bit more planning, I'll also create the sticker-models repo.

stickeritis / sticker

Consider moving the sticker repositories to an organization #117