huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

Installing datasets and transformers in a tensorflow docker image throws Permission Error on 'import transformers' #1581

Closed eduardofv closed 3 years ago

eduardofv commented 3 years ago

I am using a docker container, based on latest tensorflow-gpu image, to run transformers and datasets (4.0.1 and 1.1.3 respectively - Dockerfile attached below). Importing transformers throws a Permission Error to access /.cache:

$ docker run --gpus=all --rm -it -u $(id -u):$(id -g) -v $(pwd)/data:/root/data -v $(pwd):/root -v $(pwd)/models/:/root/models -v $(pwd)/saved_models/:/root/saved_models -e "HOST_HOSTNAME=$(hostname)" hf-error:latest /bin/bash

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/

You are running this container as user with ID 1000 and group 1000,
which should map to the ID and group for your user on the Docker host. Great!

tf-docker /root > python
Python 3.6.9 (default, Oct  8 2020, 12:12:24) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformers
2020-12-15 23:53:21.165827: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/transformers/__init__.py", line 22, in <module>
    from .integrations import (  # isort:skip
  File "/usr/local/lib/python3.6/dist-packages/transformers/integrations.py", line 5, in <module>
    from .trainer_utils import EvaluationStrategy
  File "/usr/local/lib/python3.6/dist-packages/transformers/trainer_utils.py", line 25, in <module>
    from .file_utils import is_tf_available, is_torch_available, is_torch_tpu_available
  File "/usr/local/lib/python3.6/dist-packages/transformers/file_utils.py", line 88, in <module>
    import datasets  # noqa: F401
  File "/usr/local/lib/python3.6/dist-packages/datasets/__init__.py", line 26, in <module>
    from .arrow_dataset import Dataset, concatenate_datasets
  File "/usr/local/lib/python3.6/dist-packages/datasets/arrow_dataset.py", line 40, in <module>
    from .arrow_reader import ArrowReader
  File "/usr/local/lib/python3.6/dist-packages/datasets/arrow_reader.py", line 31, in <module>
    from .utils import cached_path, logging
  File "/usr/local/lib/python3.6/dist-packages/datasets/utils/__init__.py", line 20, in <module>
    from .download_manager import DownloadManager, GenerateMode
  File "/usr/local/lib/python3.6/dist-packages/datasets/utils/download_manager.py", line 25, in <module>
    from .file_utils import HF_DATASETS_CACHE, cached_path, get_from_cache, hash_url_to_filename
  File "/usr/local/lib/python3.6/dist-packages/datasets/utils/file_utils.py", line 118, in <module>
    os.makedirs(HF_MODULES_CACHE, exist_ok=True)
  File "/usr/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/usr/lib/python3.6/os.py", line 210, in makedirs
    makedirs(head, mode, exist_ok)
  File "/usr/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/.cache'

I've pinned the problem to RUN pip install datasets, and by commenting it you can actually import transformers correctly. Another workaround I've found is creating the directory and giving permissions to it directly on the Dockerfile.

FROM tensorflow/tensorflow:latest-gpu-jupyter
WORKDIR /root

EXPOSE 80
EXPOSE 8888
EXPOSE 6006

ENV SHELL /bin/bash
ENV PATH="/root/.local/bin:${PATH}"

ENV CUDA_CACHE_PATH="/root/cache/cuda"
ENV CUDA_CACHE_MAXSIZE="4294967296"

ENV TFHUB_CACHE_DIR="/root/cache/tfhub"

RUN pip install --upgrade pip

RUN apt update -y && apt upgrade -y

RUN pip install transformers

#Installing datasets will throw the error, try commenting and rebuilding
RUN pip install datasets

#Another workaround is creating the directory and give permissions explicitly
#RUN mkdir /.cache
#RUN chmod 777 /.cache
lhoestq commented 3 years ago

Thanks for reporting ! You can override the directory in which cache file are stored using for example

ENV HF_HOME="/root/cache/hf_cache_home"

This way both transformers and datasets will use this directory instead of the default .cache

eduardofv commented 3 years ago

Great, thanks. I didn't see documentation about than ENV variable, looks like an obvious solution.

tangzhy commented 3 years ago

Thanks for reporting ! You can override the directory in which cache file are stored using for example

ENV HF_HOME="/root/cache/hf_cache_home"

This way both transformers and datasets will use this directory instead of the default .cache

can we disable caching directly?

lhoestq commented 3 years ago

Hi ! Unfortunately no since we need this directory to load datasets. When you load a dataset, it downloads the raw data files in the cache directory inside /downloads. Then it builds the dataset and saves it as arrow data inside /.

However you can specify the directory of your choice, and it can be a temporary directory if you want to clean everything up at one point.

eduardofv commented 3 years ago

I'm closing this to keep issues a bit cleaner