Docker container encountered `git` errors when running benchmark

suzhoum commented 11 months ago

I need to clone automlbenchmark in my container and run the benchmarks within, it's been running fine until just last week. I wonder what has changed that caused the git errors. It errors out when running

python3 automlbenchmark/runbenchmark.py AutoGluon:stable small test -t vehicle -f 0

Running benchmark `AutoGluon:stable` on `small` framework in `local` mode.
Loading frameworks definitions from ['/app/ag_bench_runs/tabular/ag_bench_test/automlbenchmark/resources/frameworks.yaml'].
Loading benchmark constraint definitions from ['/app/ag_bench_runs/tabular/ag_bench_test/automlbenchmark/resources/constraints.yaml'].
Loading benchmark definitions from /app/ag_bench_runs/tabular/ag_bench_test/automlbenchmark/resources/benchmarks/small.yaml.
fatal: not a git repository (or any of the parent directories): .git
[MONITORING] [local.small.test.vehicle.0.AutoGluon] CPU Utilization: 15.0%

--------------------------------------------------
Starting job local.small.test.vehicle.0.AutoGluon.
Assigning 4 cores (total=8) for new task vehicle.
[MONITORING] [local.small.test.vehicle.0.AutoGluon] Memory Usage: 9.5%
Assigning 26594 MB (total=31641 MB) for new vehicle task.
[MONITORING] [local.small.test.vehicle.0.AutoGluon] Disk Usage: 89.3%
Running task vehicle on framework AutoGluon with config:
TaskConfig({'framework': 'AutoGluon', 'framework_params': {}, 'framework_version': '0.8.2', 'type': 'classification', 'name': 'vehicle', 'openml_task_id': 53, 'test_server': False, 'fold': 0, 'metric': 'logloss', 'metrics': ['logloss', 'acc', 'balacc'], 'seed': 407703350, 'job_timeout_seconds': 1200, 'max_runtime_seconds': 600, 'cores': 4, 'max_mem_size_mb': 26594, 'min_vol_size_mb': -1, 'input_dir': '/root/.cache/openml', 'output_dir': '/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/results/autogluon.small.test.local.20230725T183716', 'output_predictions_file': '/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/results/autogluon.small.test.local.20230725T183716/predictions/vehicle/0/predictions.csv', 'tag': None, 'command': 'automlbenchmark/runbenchmark.py AutoGluon:stable small test -t vehicle -f 0 -s skip', 'git_info': {'repo': 'NA', 'branch': 'NA', 'commit': 'NA', 'tags': [], 'status': ['## No commits yet on master', '?? __init__.py', '?? __pycache__/', '?? ag_bench_runs/', '?? autogluon-bench/', '?? aws/', '?? entrypoint.sh', '?? gpu_utilization.sh', '?? setup.sh']}, 'measure_inference_time': False, 'ext': {}, 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], 'type_': 'multiclass', 'output_metadata_file': '/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/results/autogluon.small.test.local.20230725T183716/predictions/vehicle/0/metadata.json'})
File: /root/.cache/openml/org/openml/www/datasets/54/dataset_54.pq
Traceback (most recent call last):
  File "/app/ag_bench_runs/tabular/ag_bench_zs/.venv/lib/python3.9/site-packages/openml/datasets/dataset.py", line 489, in _cache_compressed_file_from_file
    data = pd.read_parquet(data_file)
  File "/app/ag_bench_runs/tabular/ag_bench_zs/.venv/lib/python3.9/site-packages/pandas/io/parquet.py", line 503, in read_parquet
    return impl.read(
  File "/app/ag_bench_runs/tabular/ag_bench_zs/.venv/lib/python3.9/site-packages/pandas/io/parquet.py", line 251, in read
    result = self.api.parquet.read_table(
  File "/app/ag_bench_runs/tabular/ag_bench_zs/.venv/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2926, in read_table
    dataset = _ParquetDatasetV2(
  File "/app/ag_bench_runs/tabular/ag_bench_zs/.venv/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 2466, in __init__
    [fragment], schema=schema or fragment.physical_schema,
  File "pyarrow/_dataset.pyx", line 1004, in pyarrow._dataset.Fragment.physical_schema.__get__
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/amlb/benchmark.py", line 578, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/frameworks/AutoGluon/__init__.py", line 16, in run
    return run_autogluon_tabular(dataset, config)
  File "/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/frameworks/AutoGluon/__init__.py", line 22, in run_autogluon_tabular
    train=dict(path=dataset.train.data_path('parquet')),
  File "/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/amlb/datasets/openml.py", line 264, in data_path
    return self._get_data(format)
  File "/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/amlb/datasets/openml.py", line 278, in _get_data
    self.dataset._load_data(fmt)
  File "/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/amlb/datasets/openml.py", line 235, in _load_data
    train, test = splitter.split()
  File "/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/amlb/utils/process.py", line 744, in profiler
    return fn(*args, **kwargs)
  File "/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/amlb/datasets/openml.py", line 415, in split
    X = self.ds._load_full_data('dataframe')
  File "/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/amlb/datasets/openml.py", line 240, in _load_full_data
    X, *_ = self._oml_dataset.get_data(dataset_format=fmt)
  File "/app/ag_bench_runs/tabular/ag_bench_zs/.venv/lib/python3.9/site-packages/openml/datasets/dataset.py", line 704, in get_data
    data, categorical, attribute_names = self._load_data()
  File "/app/ag_bench_runs/tabular/ag_bench_zs/.venv/lib/python3.9/site-packages/openml/datasets/dataset.py", line 529, in _load_data
    return self._cache_compressed_file_from_file(file_to_load)
  File "/app/ag_bench_runs/tabular/ag_bench_zs/.venv/lib/python3.9/site-packages/openml/datasets/dataset.py", line 491, in _cache_compressed_file_from_file
    raise Exception(f"File: {data_file}") from e
Exception: File: /root/.cache/openml/org/openml/www/datasets/54/dataset_54.pq
Loading metadata from `/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/results/autogluon.small.test.local.20230725T183716/predictions/vehicle/0/metadata.json`.
Metric scores: { 'acc': nan,
  'app_version': 'dev [NA, NA, NA]',
  'balacc': nan,
  'constraint': 'test',
  'duration': nan,
  'fold': 0,
  'framework': 'AutoGluon',
  'id': 'openml.org/t/53',
  'info': 'Exception: File: '
          '/root/.cache/openml/org/openml/www/datasets/54/dataset_54.pq',
  'logloss': nan,
  'metric': 'neg_logloss',
  'mode': 'local',
  'models_count': nan,
  'params': '',
  'predict_duration': nan,
  'result': nan,
  'seed': 407703350,
  'task': 'vehicle',
  'training_duration': nan,
  'type': 'multiclass',
  'utc': '2023-07-25T18:37:16',
  'version': '0.8.2'}
Job `local.small.test.vehicle.0.AutoGluon` executed in 0.021 seconds.
Scores saved to `/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/results/autogluon.small.test.local.20230725T183716/scores/results.csv`.
Scores saved to `/app/ag_bench_runs/tabular/ag_bench_zs/automlbenchmark/results/results.csv`.
All jobs executed in 0.045 seconds.
[MONITORING] [local.small.test.vehicle.0.AutoGluon] CPU Utilization: 14.3%
[MONITORING] [local.small.test.vehicle.0.AutoGluon] Memory Usage: 9.5%
[MONITORING] [local.small.test.vehicle.0.AutoGluon] Disk Usage: 89.3%
Processing results for autogluon.small.test.local.20230725T183716
Summing up scores for current run:
             id    task  fold framework constraint      metric  duration      seed                                                                          info
openml.org/t/53 vehicle     0 AutoGluon       test neg_logloss      0.02 407703350 Exception: File: /root/.cache/openml/org/openml/www/datasets/54/dataset_54.pq

This is running fine:

python3 automlbenchmark/runbenchmark.py AutoGluon:stable -s only

The Dockerfile is simple:

FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-ec2

WORKDIR /app/
COPY entrypoint.sh  /app/
RUN apt-get update && apt-get install -y cron unzip curl \
    && rm -rf /var/lib/apt/lists/* \
    && mkdir -p /var/spool/cron/crontabs \
    && touch /var/log/cron.log \
    && curl "https://d1vvhvl2y92vvt.cloudfront.net/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
    && unzip awscliv2.zip \
    && ./aws/install \
    && rm awscliv2.zip

ENTRYPOINT ["./entrypoint.sh"]

In entrypoint.sh:

DIR="ag_bench_runs/tabular/ag_bench_test"

if [ ! -d $DIR ]; then
  mkdir -p $DIR
fi

# create virtual env
python3 -m venv $DIR/.venv
source $DIR/.venv/bin/activate

# install latest AMLB
pip install --upgrade pip
pip install --upgrade setuptools wheel
git clone --depth 1 --branch stable https://github.com/openml/automlbenchmark.git $DIR/automlbenchmark
pip install -r $DIR/automlbenchmark/requirements.txt

python $DIR/automlbenchmark/runbenchmark.py AutoGluon:stable small test -t vehicle -f 0

PGijsbers commented 11 months ago

Thanks for the report. It is related to the server issues. our infrastructure is up and running again so there is no error 503, but we have this problem now. This has to be fixed serverside, I forwarded the report/request. Your report of problems starting last week matches up with the server problems at OpenML.

suzhoum commented 11 months ago

Thanks for your response @PGijsbers! One thing that I'm unsure about is that why AMLB works on my dev machine (EC2), but not in a docker. Is there any workaround at the moment for docker?

PGijsbers commented 11 months ago

Did you upgrade your dev machine to OpenML Python 0.14.1? Your docker will be at 0.13.1. Also server issues are clearing up so hopefully both work soon automagically regardless :)

suzhoum commented 11 months ago

I was actually using 0.13.1 on my dev machine as I created a new virtual env for it and installed the default packages from requirements.txt.

PGijsbers commented 11 months ago

Did you run the command in your local environment a while back? Maybe somehow the OpenML cache is not shared with the docker container, and the local environment loads a correct cached file but docker downloads a new corrupted file. I can try to reproduce it later this week, though there is a chance server issues are resolved before it.

suzhoum commented 11 months ago

Oh that totally makes sense now. Yes I have been running the benchmark in my local environment since a few months ago, and the cache should still be available, while the container is built fresh.

PGijsbers commented 11 months ago

OpenML server issues should mostly be gone now. I just ran python3 automlbenchmark/runbenchmark.py AutoGluon:stable small test -t vehicle -f 0 locally after clearing my cache (rm -rf ~/.openml/org/openml/www/datasets/54) and it completed successfully:

Processing results for autogluon.small.test.local.20230730T142453
Summing up scores for current run:
             id    task  fold framework constraint    result      metric  duration       seed
openml.org/t/53 vehicle     0 AutoGluon       test -0.404008 neg_logloss      69.0 1666932321

Can you see if the problem still occurs?

suzhoum commented 11 months ago

It's working fine now!

openml / automlbenchmark

Docker container encountered `git` errors when running benchmark #580