Open yeldarby opened 2 years ago
Hi @yeldarby. Thanks for opening this issue and following up with the fix you found.
It looks like this issue is specific to the huggingface
package, so I'm not sure what generic changes we could make here to prevent other users of Cog from running into this issue. Do you have any suggestions on how we might improve behavior or documentation around this in a generalizable way?
It's not just huggingface
; other libraries like openai/clip
and torch
also store data in ~/.cache
.
Ideally make a way to mount volumes in custom locations where data will persist between runs. Or being able to add COPY
lines to the Dockerfile
to move files into the proper location on the container's filesystem.
As a user I want something like this too. In Docker terms I'm leaning more towards ENV
than COPY
as a way to address it.
As @yeldarby mentions above, Given:
import os os.environ['TRANSFORMERS_CACHE'] = '/src/cache'
Then:
Since
/src
is mounted from my local filesystem it looks like it saved the weights in there the first time I ran it and loaded them the next time.
Maybe cog
could do something like this, but more generic, by default. In Dockerfile terms, it'd be like:
ENV TRANSFORMERS_CACHE=/src/cache
TRANSFORMERS_CACHE
. But what about ~/.cache
?~/.cache
is the conventional default value for $XDG_CACHE_HOME
in Linux environments.
Of course some libraries may hardcode ~/.cache
, but many libraries respect $XDG_CACHE_HOME
, including pytorch, huggingface, pip ... (Beyond ML/Python, there are many, many more ...)
~/.cache
cannot be brought into the container wholesaleIt's very important that the host's ~/.cache
does not get automatically/implicitly pulled-in in whole, because a user could have private/sensitive/unrelated data in there. (Aside from bloat.)
But inside the container, cog
might want to control ~~/.cache
~ $XDG_CACHE_HOME
.
cog.yaml
could set a default $XDG_CACHE_HOME
If cog
's docker code could set:
ENV XDG_CACHE_HOME=/src/.cache
... early in the build
, so it's always set when predict
is running... Then, any setup code that is running within a cog container, that downloads to XDG_CACHE_HOME
(or subfolder), will download the models etc into /src/.cache
, instead of ~/.cache
.
This could make a big difference because /src
can be retained between runs of cog predict
(whereas cog
does not want to retain all of ~
/$HOME
).
XDG_CACHE_HOME
would only affect libraries that use XDG_CACHE_HOME
, but it's a popular conventionThe reason for bringing up XDG_CACHE_HOME
is that at least pytorch and huggingface will respect it when they are choosing defaults for TORCH_HOME
/TRANSFORMERS_CACHE
/etc.
However, some library may not respect XDG_CACHE_HOME
, or it may hardcode <LIBRARY_SPECIFIC_CACHE>
to ~/.cache/<subdir>
or something. For those cases, the library-specific env var is more important.
TRANSFORMERS_CACHE
, TORCH_HOME
, etc.A basic solution for adding ENV
instructions would go a long way for handling library-specific environment variables, and that could be done regardless of the other ideas in my comment.
Maybe cog
could eventually start including some ENV
"presets" for popular ML libraries.
cog.yaml
could provide a way to add ENV
directives (#291)XDG_CACHE_HOME
might look like a mouthful, but it's really useful!(Re-opened the PR to add support for ENV
, and a default XDG_CACHE_HOME
.)
Every time I do
cog predict
, thehuggingface
package is downloading the 500MB GPT2 pretrained weights into~/.cache/huggingface
which is quite slow & bandwidth intensive.I can't figure out how to bundle these into the Docker.
So far I've tried
run
incog.yaml
(doesn't work because it doesn't have access to the file yet)~/.cache/huggingface
into my working dir and having mypredict.py
copy the files at the top (which doesn't work for unknown reasons; possibly everything outside of/src
is read-only)os.system("cp -r /src/cache/huggingface /root/.cache/huggingface")
Update Fixed this for my use-case by overriding the
transformers
cache location with an environment variable:Since
/src
is mounted from my local filesystem it looks like it saved the weights in there the first time I ran it and loaded them the next time.