Copy ~/.cache/x into Docker

yeldarby commented 2 years ago

Every time I do cog predict, the huggingface package is downloading the 500MB GPT2 pretrained weights into ~/.cache/huggingface which is quite slow & bandwidth intensive.

I can't figure out how to bundle these into the Docker.

So far I've tried

Using run in cog.yaml (doesn't work because it doesn't have access to the file yet)
Copying ~/.cache/huggingface into my working dir and having my predict.py copy the files at the top (which doesn't work for unknown reasons; possibly everything outside of /src is read-only) os.system("cp -r /src/cache/huggingface /root/.cache/huggingface")

Update Fixed this for my use-case by overriding the transformers cache location with an environment variable:

import os
os.environ['TRANSFORMERS_CACHE'] = '/src/cache'

Since /src is mounted from my local filesystem it looks like it saved the weights in there the first time I ran it and loaded them the next time.

zeke commented 2 years ago

Hi @yeldarby. Thanks for opening this issue and following up with the fix you found.

It looks like this issue is specific to the huggingface package, so I'm not sure what generic changes we could make here to prevent other users of Cog from running into this issue. Do you have any suggestions on how we might improve behavior or documentation around this in a generalizable way?

yeldarby commented 2 years ago

It's not just huggingface; other libraries like openai/clip and torch also store data in ~/.cache.

Ideally make a way to mount volumes in custom locations where data will persist between runs. Or being able to add COPY lines to the Dockerfile to move files into the proper location on the container's filesystem.

floer32 commented 2 years ago

I want something like this!

As a user I want something like this too. In Docker terms I'm leaning more towards ENV than COPY as a way to address it.

In the terms above...

As @yeldarby mentions above, Given:

import os
os.environ['TRANSFORMERS_CACHE'] = '/src/cache'

Then:

Since /src is mounted from my local filesystem it looks like it saved the weights in there the first time I ran it and loaded them the next time.

Maybe cog could do something like this, but more generic, by default. In Dockerfile terms, it'd be like:

ENV TRANSFORMERS_CACHE=/src/cache

So that's huggingface's `TRANSFORMERS_CACHE`. But what about `~/.cache`?

~/.cache is the conventional default value for $XDG_CACHE_HOME in Linux environments.

Of course some libraries may hardcode ~/.cache, but many libraries respect $XDG_CACHE_HOME, including pytorch, huggingface, pip ... (Beyond ML/Python, there are many, many more ...)

Security note: host's `~/.cache` cannot be brought into the container wholesale

It's very important that the host's ~/.cache does not get automatically/implicitly pulled-in in whole, because a user could have private/sensitive/unrelated data in there. (Aside from bloat.)

But inside the container, cog might want to control ~~/.cache~ $XDG_CACHE_HOME.

Brainstorming

💡 maybe `cog.yaml` could set a default `$XDG_CACHE_HOME`

If cog's docker code could set:

ENV XDG_CACHE_HOME=/src/.cache

... early in the build, so it's always set when predict is running... Then, any setup code that is running within a cog container, that downloads to XDG_CACHE_HOME (or subfolder), will download the models etc into /src/.cache, instead of ~/.cache.

This could make a big difference because /src can be retained between runs of cog predict (whereas cog does not want to retain all of ~/$HOME).

Caveat: `XDG_CACHE_HOME` would only affect libraries that use `XDG_CACHE_HOME`, but it's a popular convention

The reason for bringing up XDG_CACHE_HOME is that at least pytorch and huggingface will respect it when they are choosing defaults for TORCH_HOME/TRANSFORMERS_CACHE/etc.

However, some library may not respect XDG_CACHE_HOME, or it may hardcode <LIBRARY_SPECIFIC_CACHE> to ~/.cache/<subdir> or something. For those cases, the library-specific env var is more important.

`TRANSFORMERS_CACHE`, `TORCH_HOME`, etc.

A basic solution for adding ENV instructions would go a long way for handling library-specific environment variables, and that could be done regardless of the other ideas in my comment.

Maybe cog could eventually start including some ENV "presets" for popular ML libraries.

Thinking about that...

* Maybe `cog` could start including some `ENV` "presets" for popular ML libraries. * Carefully, of course. * There's precedent: `cog` already "knows" about making sure that `pip` cache is handled cleanly for dockerized usage. That works really well because the user _doesn't_ need to control pip's caching at all. OTOH a user _should_ know if `TRANSFORMERS_CACHE`/etc are being overridden. * In my opinion, `cog.yaml` is probably where library-specific `ENV` overrides should live. `cog.yaml` is already the explicit interface for how a cog container builds/runs/predicts; and if #291 is implemented then a user might have _some_ custom `ENV` stuff, yet still want `cog`'s smarts/inferences about their dependencies.

TLDR

cog.yaml could provide a way to add ENV directives (#291)
XDG_CACHE_HOME might look like a mouthful, but it's really useful!

Wait, what about running preparation code on the host, then having it cached in the container?

First note that you don't have to. You can have preparation code that runs in `predict.py` upon the first run, and if it saves cache files inside of `/src` (which is the `WORKDIR` of the cog container) ... then the files will be present on your host in the corresponding directory. They'll be in the container in subsequent runs, because `cog` mounts the working directory to `/src`. So on subsequent `cog predict` runs it should be restoring from cache. And when you do `cog push`, it would include files present in the working directory. But... if you really want to run prep code on the host, and reuse cached files during `cog predict`, then you just need to tell your prep code where to store files. Point it to a subdirectory. If you want prep code to be run on the host, it's your choice how to do that. On Linux you could do `XDG_CACHE_HOME=./cache python custom_prepare.py` ... but on any platform you could do `os.environ['XDG_CACHE_HOME'] = os.path.join(os.getcwd(), 'cache')` in your Python code. The same principle would work for other library-specific environment variables, though that's not needed for [PyTorch](https://pytorch.org/docs/stable/hub.html#:~:text=XDG_CACHE_HOME) nor [Huggingface](https://huggingface.co/transformers/v4.0.1/installation.html#caching-models), since they both respect `XDG_CACHE_HOME`. Many libraries do.

floer32 commented 2 years ago

(Re-opened the PR to add support for ENV, and a default XDG_CACHE_HOME.)

replicate / cog