openml / openml-python

OpenML's Python API for a World of Data and More 💫
http://openml.github.io/openml-python/
Other
280 stars 144 forks source link

Process killed when pickling dataset #780

Open PGijsbers opened 5 years ago

PGijsbers commented 5 years ago

Several users seem to experience difficulties downloading dataset 41081. The process is killed with a sigkill signal when pickling the dataset to disk here.

We deduce this from the fact that the file is created, but not populated with the dataset (its file size is 0 bytes). It is reinforced by the fact that it occurs when the arff file is predownloaded. it does not occur if a 'healthy' dataset.pkl.py3 file is present.

The origin of the report is this issue, which also specifies two environments of users which produced the error consistently.

PGijsbers commented 5 years ago

Before diving deep to recreate all aspects of the user's environment, I did try to reproduce this in my own environment. No failure happened with either Python 3.7.3 nor 3.7.4.

joaquinvanschoren commented 5 years ago

Any idea on what is causing this?

PGijsbers commented 5 years ago

No idea, I have yet to be able to reproduce it. There have been some changes to pickle in python 3.7.4, but those supposedly fixed a memory leak. I don't have time to delve into this this week.

yorickvanzweeden commented 5 years ago

I am currently experiencing this. I am using Python 3.7.4. This is occuring using a Conda environment and without.

Code ran:

import openml as oml
SVHN = oml.datasets.get_dataset(41081)

It results in a memory leak and the process is eventually killed. I am using Ubuntu 18.04 with kernel 4.15.0-64-generic

PGijsbers commented 5 years ago

Managed to mostly recreate this problem in Docker. To reproduce it, spawn a container from an image created with:

FROM ubuntu:18.04

ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update

RUN apt-get install software-properties-common -y
RUN apt-get update

#RUN apt-get install add-apt-repository
#RUN add-apt-repository universe

# Install Python3.7.4, regular universe version seems to only get 3.7.3?

RUN apt-get install build-essential checkinstall -y
RUN apt-get install libreadline-gplv2-dev libncursesw5-dev libssl-dev libsqlite3-dev tk-dev libgdbm-dev libc6-dev libbz2-dev libffi-dev zlib1g-dev wget -y

RUN wget https://www.python.org/ftp/python/3.7.4/Python-3.7.4.tgz
RUN tar xzf Python-3.7.4.tgz

WORKDIR Python-3.7.4
RUN ./configure --enable-optimizations
RUN make altinstall

# Install OpenML
RUN apt-get install git -y
WORKDIR ~
RUN git clone https://github.com/openml/openml-python.git
WORKDIR openml-python
RUN python3.7 -m pip install -e .

then run

root@c04371750d76:/Python-3.7.4/~/openml-python# python3.7
Python 3.7.4 (default, Sep 23 2019, 14:59:26)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import openml as oml
/usr/local/lib/python3.7/site-packages/pandas/compat/__init__.py:84: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
>>> SVHN = oml.datasets.get_dataset(41081)
Killed

However, downloading the dataset again rather kills the process again instead of raising an EOFError. I also did try to specifically install 8efcf9d42e9d325307c5fed190fc4caf00cabb7c which should be the 0.9 release (we forgot to tag it), but that resulted in the same. For whatever reason the arff file is not saved in this setup.

So I'll have to verify what is killing the process in my case is still the pickling. If so, I'll have a look to see if there is a way for us to pickle it without triggering the memory leak. That said, I'll first work on the workaround with #779 because I think it's both unlikely we can fix this elegantly and I'm hoping the problem will 'fix itself' in a later Python release.

PGijsbers commented 5 years ago

I am having issues identifying which datasets are susceptible to the freeze, but have not encountered any issues outside of pickling. To avoid spending too much time on an issue which (hopefully) gets resolved in the next Python version (as it seems to be a pickle issue), and for which a work-around exists (see #787), I am going to work on other issues. If a subsequent Python version solves the issue, I think we should consider adding a warning message for 3.7.4 users and closing this issue.

To anyone still experiencing this issue, make sure you install the development version of openml-python, which has a fix that loads the data from arff on the second try. If this also does not work for you, please let us know in this issue.