vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

Problems with vaex to support Python3 #369

Closed fprada closed 5 years ago

fprada commented 5 years ago

Once we installed vaex using conda the original Python3 was replaced by Python2 when call ipython. In principle vaex supports Python3, how to avoid this?

maartenbreddels commented 5 years ago

What I suspect happening is:

Now when you execute $ ipython, you may end up in the root Python environment, which is Python 2. You can check this by executing $ which ipython (which will not e.g. ~/anaconda/envs/vaex/bin/ipython). Is this correct?

fprada commented 5 years ago

Hi again, here more details,

We have installed last anaconda version with python3 as root.

When we follow your instructions:

conda install -c maartenbreddels vaex

Vaex is installed sucesfully but it downgrades our conda python version to 2.7.6

Then I can invoke python from conda that is 2.7.6 version, and I can import vaex with “import vaex”, but everything works in python 2.7.6

Can we use vaex in python3?

Regards.

P.S. Here is the attached message when we try to install vaex with conda:


[root@skun6 ~]# conda install -c maartenbreddels vaex WARNING: The conda.compat module is deprecated and will be removed in a future release. Collecting package metadata: done Solving environment: done

Package Plan

environment location: /usr/local/anaconda3

added / updated specs:

The following packages will be downloaded:

package                    |            build
---------------------------|-----------------
_ipyw_jlab_nb_ext_conf-0.1.0|           py27_0           4 KB
_libgcc_mutex-0.1          |             main           3 KB
alabaster-0.7.12           |           py27_0          17 KB
anaconda-client-1.7.2      |           py27_0         140 KB
anaconda-navigator-1.9.7   |           py27_0         4.8 MB
anaconda-project-0.8.3     |             py_0         212 KB
aplus-0.11.0               |           py27_0           9 KB  maartenbreddels
asn1crypto-0.24.0          |           py27_0         155 KB
astroid-1.6.5              |           py27_0         402 KB
astropy-2.0.9              |   py27hdd07704_0         6.8 MB
atomicwrites-1.3.0         |           py27_1          13 KB
attrdict-2.0.0             |           py27_0          18 KB  maartenbreddels
attrs-19.1.0               |           py27_1          56 KB
babel-2.7.0                |             py_0         5.8 MB
backcall-0.1.0             |           py27_0          19 KB
backports-1.0              |             py_2         139 KB
backports.functools_lru_cache-1.5|             py_2           9 KB
backports.os-0.1.1         |           py27_0          15 KB
backports.shutil_get_terminal_size-1.0.0|           py27_2           8 KB
backports.tempfile-1.0     |             py_1          12 KB
backports.weakref-1.0.post1|             py_1           7 KB
backports_abc-0.5          |             py_0          13 KB
beautifulsoup4-4.6.3       |           py27_0         135 KB
bitarray-0.9.3             |   py27h7b6447c_0          60 KB
bkcharts-0.2               |           py27_0         124 KB
bleach-3.1.0               |           py27_0         228 KB
bokeh-1.3.0                |           py27_0         4.0 MB
boto-2.49.0                |           py27_0         1.4 MB
bottleneck-1.2.1           |   py27h035aef0_1         127 KB
ca-certificates-2019.5.15  |                0         133 KB
cachetools-1.1.6           |           py27_0          24 KB  maartenbreddels
certifi-2019.6.16          |           py27_1         156 KB
cffi-1.12.3                |   py27h2e261b9_0         218 KB
chardet-3.0.4              |        py27_1003         186 KB
click-7.0                  |           py27_0         116 KB
cloudpickle-1.2.1          |             py_0          28 KB
clyent-1.2.2               |           py27_1          18 KB
colorama-0.4.1             |           py27_0          24 KB
conda-4.7.10               |           py27_0         3.0 MB
conda-build-3.18.9         |           py27_0         530 KB
conda-package-handling-1.3.11|           py27_0         260 KB
conda-verify-3.4.2         |             py_1          25 KB
configparser-3.7.4         |           py27_0          41 KB
contextlib2-0.5.5          |           py27_0          15 KB
cryptography-2.7           |   py27h1ba5d50_0         602 KB
cycler-0.10.0              |           py27_0          13 KB
cython-0.29.12             |   py27he6710b0_0         2.2 MB
cytoolz-0.10.0             |   py27h7b6447c_0         422 KB
dask-1.2.2                 |             py_0          11 KB
dask-core-1.2.2            |             py_0         539 KB
decorator-4.4.0            |           py27_1          18 KB
defusedxml-0.6.0           |             py_0          23 KB
distributed-1.28.1         |           py27_0         852 KB
docutils-0.15.1            |           py27_0         743 KB
entrypoints-0.3            |           py27_0          12 KB
enum34-1.1.6               |           py27_1          57 KB
et_xmlfile-1.0.1           |           py27_0          20 KB
fastcache-1.1.0            |   py27h7b6447c_0          31 KB
filelock-3.0.12            |             py_0          12 KB
flask-1.1.1                |             py_0          73 KB
funcsigs-1.0.2             |           py27_0          20 KB
functools32-3.2.3.2        |           py27_1          23 KB
future-0.17.1              |           py27_0         710 KB
futures-3.3.0              |           py27_0          28 KB
gevent-1.4.0               |   py27h7b6447c_0         2.5 MB
glob2-0.7                  |             py_0          14 KB
gmpy2-2.0.8                |   py27h10f8cd9_2         168 KB
greenlet-0.4.15            |   py27h7b6447c_0          20 KB
h5py-2.9.0                 |   py27h7918eee_0         1.1 MB
heapdict-1.0.0             |           py27_2           8 KB
html5lib-1.0.1             |           py27_0         189 KB
idna-2.8                   |           py27_0         133 KB
imageio-2.5.0              |           py27_0         3.3 MB
imagesize-1.1.0            |           py27_0           9 KB
ipaddress-1.0.22           |           py27_0          32 KB
ipykernel-4.10.0           |           py27_0         145 KB
ipython-5.8.0              |           py27_0         1.0 MB
ipython_genutils-0.2.0     |           py27_0          38 KB
ipywidgets-7.5.0           |             py_0         107 KB
isort-4.3.21               |           py27_0          68 KB
itsdangerous-1.1.0         |           py27_0          26 KB
jdcal-1.4.1                |             py_0          11 KB
jedi-0.13.3                |           py27_0         233 KB
jinja2-2.10.1              |           py27_0         181 KB
jprops-1.0                 |           py27_0           9 KB  maartenbreddels
jsonschema-3.0.1           |           py27_0          86 KB
jupyter-1.0.0              |           py27_7           6 KB
jupyter_client-5.3.1       |             py_0          69 KB
jupyter_console-5.2.0      |           py27_1          35 KB
jupyter_core-4.5.0         |             py_0          48 KB
jupyterlab-0.33.11         |           py27_0        10.0 MB
jupyterlab_launcher-0.11.2 |   py27h28b3542_0          32 KB
keyring-18.0.0             |           py27_0          54 KB
kiwisolver-1.1.0           |   py27he6710b0_0          91 KB
lazy-object-proxy-1.4.1    |   py27h7b6447c_0          29 KB
linecache2-1.0.0           |           py27_0          24 KB
llvmlite-0.29.0            |   py27hd408876_0        17.7 MB
locket-0.2.0               |           py27_1           8 KB
lxml-4.3.4                 |   py27hefd8a0e_0         1.4 MB
markupsafe-1.1.1           |   py27h7b6447c_0          29 KB
matplotlib-2.2.3           |   py27hb69df0a_0         6.5 MB
mccabe-0.6.1               |           py27_1          13 KB
mistune-0.8.4              |   py27h7b6447c_0          53 KB
mkl-service-2.0.2          |   py27h7b6447c_0          67 KB
mkl_fft-1.0.12             |   py27ha843d7b_0         163 KB
mkl_random-1.0.2           |   py27hd81dba3_0         383 KB
more-itertools-5.0.0       |           py27_0          86 KB
mpmath-1.1.0               |           py27_0         972 KB
msgpack-python-0.6.1       |   py27hfd86e86_1          90 KB
multipledispatch-0.6.0     |           py27_0          21 KB
navigator-updater-0.2.1    |           py27_0         1.2 MB
nbconvert-5.5.0            |             py_0         381 KB
nbformat-4.4.0             |           py27_0         139 KB
networkx-2.2               |           py27_1         2.0 MB
nltk-3.4.4                 |           py27_0         2.1 MB
nose-1.3.7                 |           py27_2         213 KB
notebook-5.7.8             |           py27_0         7.2 MB
numba-0.45.0               |   py27h962f231_0         3.0 MB
numexpr-2.6.9              |   py27h9e4a6bb_0         193 KB
numpy-1.16.4               |   py27h7e9f1db_0          49 KB
numpy-base-1.16.4          |   py27hde5b4d6_0         4.3 MB
numpydoc-0.9.1             |             py_0          31 KB
olefile-0.46               |           py27_0          48 KB
openpyxl-2.6.2             |             py_0         157 KB
openssl-1.1.1c             |       h7b6447c_1         3.8 MB
packaging-19.0             |           py27_0          37 KB
pandas-0.24.2              |   py27he6710b0_0        10.9 MB
pandocfilters-1.4.2        |           py27_1          13 KB
parso-0.5.0                |             py_0          67 KB
partd-1.0.0                |             py_0          19 KB
path.py-11.1.0             |           py27_0          52 KB
pathlib2-2.3.4             |           py27_0          35 KB
patsy-0.5.1                |           py27_0         375 KB
pep8-1.7.1                 |           py27_0          51 KB
pexpect-4.7.0              |           py27_0          80 KB
pickleshare-0.7.5          |           py27_0          12 KB
pillow-6.1.0               |   py27h34e0f95_0         631 KB
pip-19.1.1                 |           py27_0         1.8 MB
pkginfo-1.5.0.1            |           py27_0          41 KB
pluggy-0.11.0              |             py_0          20 KB
ply-3.11                   |           py27_0          79 KB
progressbar2-3.6.0         |           py27_0          25 KB  maartenbreddels
prometheus_client-0.7.1    |             py_0          42 KB
prompt_toolkit-1.0.15      |           py27_0         333 KB
psutil-5.6.3               |   py27h7b6447c_0         321 KB
ptyprocess-0.6.0           |           py27_0          22 KB
py-1.8.0                   |           py27_0         137 KB
py-lief-0.9.0              |   py27h7725739_2         1.6 MB
pycodestyle-2.5.0          |           py27_0          60 KB
pycosat-0.6.3              |   py27h14c3975_0         103 KB
pycparser-2.19             |           py27_0         173 KB
pycrypto-2.6.1             |   py27h14c3975_9         460 KB
pycurl-7.43.0.2            |   py27h1ba5d50_0         184 KB
pyflakes-2.1.1             |           py27_0         100 KB
pygments-2.4.2             |             py_0         664 KB
pylint-1.9.2               |           py27_0         772 KB
pyodbc-4.0.26              |   py27he6710b0_0          71 KB
pyopengl-3.1.1a1           |           py27_0         1.3 MB
pyopenssl-19.0.0           |           py27_0          80 KB
pyparsing-2.4.0            |             py_0          58 KB
pyqt-5.9.2                 |   py27h05f1152_2         5.4 MB
pyrsistent-0.14.11         |   py27h7b6447c_0          88 KB
pysocks-1.7.0              |           py27_0          29 KB
pytables-3.5.1             |   py27h71ec239_0         1.4 MB
pytest-4.5.0               |           py27_0         358 KB
pytest-arraydiff-0.3       |   py27h39e3cac_0          15 KB
pytest-astropy-0.5.0       |           py27_0           6 KB
pytest-doctestplus-0.3.0   |           py27_0          23 KB
pytest-openfiles-0.3.2     |           py27_0          11 KB
pytest-remotedata-0.3.1    |           py27_0          13 KB
python-2.7.16              |       h9bab390_0        12.8 MB
python-dateutil-2.8.0      |           py27_0         279 KB
python-libarchive-c-2.8    |          py27_11          22 KB
pytz-2019.1                |             py_0         236 KB
pywavelets-1.0.3           |   py27hdd07704_1         4.4 MB
pyyaml-5.1.1               |   py27h7b6447c_0         177 KB
pyzmq-18.0.0               |   py27he6710b0_0         463 KB
qtawesome-0.5.7            |           py27_1         615 KB
qtconsole-4.5.2            |             py_0          92 KB
qtpy-1.8.0                 |             py_0          38 KB
requests-2.22.0            |           py27_0          89 KB
rope-0.14.0                |             py_0         113 KB
ruamel_yaml-0.15.46        |   py27h14c3975_0         241 KB
scandir-1.10.0             |   py27h7b6447c_0          32 KB
scikit-image-0.14.2        |   py27he6710b0_0        24.0 MB
scikit-learn-0.20.3        |   py27hd81dba3_0         5.8 MB
scipy-1.2.1                |   py27h7c811a0_0        17.6 MB
seaborn-0.9.0              |           py27_0         374 KB
send2trash-1.5.0           |           py27_0          16 KB
setuptools-41.0.1          |           py27_0         640 KB
simplegeneric-0.8.1        |           py27_2           9 KB
singledispatch-3.4.0.3     |           py27_0          15 KB
sip-4.19.8                 |   py27hf484d3e_0         291 KB
six-1.12.0                 |           py27_0          22 KB
snowballstemmer-1.9.0      |             py_0          53 KB
sortedcollections-1.1.2    |           py27_0          17 KB
sortedcontainers-2.1.0     |           py27_0          44 KB
sphinx-1.8.5               |           py27_0         1.9 MB
sphinxcontrib-1.0          |           py27_1           3 KB
sphinxcontrib-websupport-1.1.2|             py_0          35 KB
spyder-3.3.6               |           py27_0         2.5 MB
spyder-kernels-0.5.1       |           py27_0          68 KB
sqlalchemy-1.3.5           |   py27h7b6447c_0         1.7 MB
statsmodels-0.10.1         |   py27hdd07704_0         9.6 MB
subprocess32-3.5.4         |   py27h7b6447c_0          49 KB
sympy-1.4                  |           py27_0         9.9 MB
tblib-1.4.0                |             py_0          14 KB
terminado-0.8.2            |           py27_0          22 KB
testpath-0.4.2             |           py27_0          91 KB
toolz-0.10.0               |             py_0          50 KB
tornado-5.1.1              |   py27h7b6447c_0         643 KB
tqdm-4.32.1                |             py_0          48 KB
traceback2-1.4.0           |           py27_0          30 KB
traitlets-4.3.2            |           py27_0         128 KB
typing-3.7.4               |           py27_0          49 KB
unicodecsv-0.14.1          |           py27_0          24 KB
unittest2-1.1.0            |           py27_0         143 KB
urllib3-1.24.2             |           py27_0         151 KB
vaex-1.0.0b2               |           py27_0         1.0 MB  maartenbreddels
wcwidth-0.1.7              |           py27_0          25 KB
webencodings-0.5.1         |           py27_1          19 KB
werkzeug-0.15.4            |             py_0         262 KB
wheel-0.33.4               |           py27_0          39 KB
widgetsnbextension-3.5.0   |           py27_0         1.8 MB
wrapt-1.11.2               |   py27h7b6447c_0          48 KB
wurlitzer-1.0.2            |           py27_0          12 KB
xlrd-1.2.0                 |           py27_0         187 KB
xlsxwriter-1.1.8           |             py_0         105 KB
xlwt-1.3.0                 |           py27_0         160 KB
zict-1.0.0                 |             py_0          12 KB
zipp-0.5.1                 |             py_0           8 KB
------------------------------------------------------------
                                       Total:       234.9 MB

The following NEW packages will be INSTALLED:

_libgcc_mutex pkgs/main/linux-64::_libgcc_mutex-0.1-main aplus maartenbreddels/linux-64::aplus-0.11.0-py27_0 attrdict maartenbreddels/linux-64::attrdict-2.0.0-py27_0 backports.functoo~ pkgs/main/noarch::backports.functools_lru_cache-1.5-py_2 backports.tempfile pkgs/main/noarch::backports.tempfile-1.0-py_1 backports.weakref pkgs/main/noarch::backports.weakref-1.0.post1-py_1 backports_abc pkgs/main/noarch::backports_abc-0.5-py_0 cachetools maartenbreddels/linux-64::cachetools-1.1.6-py27_0 conda-package-han~ pkgs/main/linux-64::conda-package-handling-1.3.11-py27_0 configparser pkgs/main/linux-64::configparser-3.7.4-py27_0 enum34 pkgs/main/linux-64::enum34-1.1.6-py27_1 funcsigs pkgs/main/linux-64::funcsigs-1.0.2-py27_0 functools32 pkgs/main/linux-64::functools32-3.2.3.2-py27_1 futures pkgs/main/linux-64::futures-3.3.0-py27_0 ipaddress pkgs/main/linux-64::ipaddress-1.0.22-py27_0 jprops maartenbreddels/linux-64::jprops-1.0-py27_0 jupyterlab_launch~ pkgs/main/linux-64::jupyterlab_launcher-0.11.2-py27h28b3542_0 linecache2 pkgs/main/linux-64::linecache2-1.0.0-py27_0 progressbar2 maartenbreddels/linux-64::progressbar2-3.6.0-py27_0 pyopengl pkgs/main/linux-64::pyopengl-3.1.1a1-py27_0 scandir pkgs/main/linux-64::scandir-1.10.0-py27h7b6447c_0 subprocess32 pkgs/main/linux-64::subprocess32-3.5.4-py27h7b6447c_0 traceback2 pkgs/main/linux-64::traceback2-1.4.0-py27_0 typing pkgs/main/linux-64::typing-3.7.4-py27_0 unittest2 pkgs/main/linux-64::unittest2-1.1.0-py27_0 vaex maartenbreddels/linux-64::vaex-1.0.0b2-py27_0

The following packages will be REMOVED:

anaconda-2019.03-py37_0 importlib_metadata-0.8-py37_0 jeepney-0.4-py37_0 jupyterlab_server-0.2.0-py37_0 secretstorage-3.1.1-py37_0 soupsieve-1.8-py37_0

The following packages will be UPDATED:

anaconda-project pkgs/main/linux-64::anaconda-project-~ --> pkgs/main/noarch::anaconda-project-0.8.3-py_0 babel pkgs/main/linux-64::babel-2.6.0-py37_0 --> pkgs/main/noarch::babel-2.7.0-py_0 backports pkgs/main/linux-64::backports-1.0-py3~ --> pkgs/main/noarch::backports-1.0-py_2 bitarray 0.8.3-py37h14c3975_0 --> 0.9.3-py27h7b6447c_0 bokeh 1.0.4-py37_0 --> 1.3.0-py27_0 ca-certificates 2019.1.23-0 --> 2019.5.15-0 certifi 2019.3.9-py37_0 --> 2019.6.16-py27_1 cffi 1.12.2-py37h2e261b9_1 --> 1.12.3-py27h2e261b9_0 chardet 3.0.4-py37_1 --> 3.0.4-py27_1003 cloudpickle pkgs/main/linux-64::cloudpickle-0.8.0~ --> pkgs/main/noarch::cloudpickle-1.2.1-py_0 conda 4.6.11-py37_0 --> 4.7.10-py27_0 conda-build 3.17.8-py37_0 --> 3.18.9-py27_0 conda-verify pkgs/main/linux-64::conda-verify-3.1.~ --> pkgs/main/noarch::conda-verify-3.4.2-py_1 cryptography 2.6.1-py37h1ba5d50_0 --> 2.7-py27h1ba5d50_0 cython 0.29.6-py37he6710b0_0 --> 0.29.12-py27he6710b0_0 cytoolz 0.9.0.1-py37h14c3975_1 --> 0.10.0-py27h7b6447c_0 dask pkgs/main/linux-64::dask-1.1.4-py37_1 --> pkgs/main/noarch::dask-1.2.2-py_0 dask-core pkgs/main/linux-64::dask-core-1.1.4-p~ --> pkgs/main/noarch::dask-core-1.2.2-py_0 defusedxml pkgs/main/linux-64::defusedxml-0.5.0-~ --> pkgs/main/noarch::defusedxml-0.6.0-py_0 distributed 1.26.0-py37_1 --> 1.28.1-py27_0 docutils 0.14-py37_0 --> 0.15.1-py27_0 fastcache 1.0.2-py37h14c3975_2 --> 1.1.0-py27h7b6447c_0 filelock pkgs/main/linux-64::filelock-3.0.10-p~ --> pkgs/main/noarch::filelock-3.0.12-py_0 flask pkgs/main/linux-64::flask-1.0.2-py37_1 --> pkgs/main/noarch::flask-1.1.1-py_0 glob2 pkgs/main/linux-64::glob2-0.6-py37_1 --> pkgs/main/noarch::glob2-0.7-py_0 ipywidgets pkgs/main/linux-64::ipywidgets-7.4.2-~ --> pkgs/main/noarch::ipywidgets-7.5.0-py_0 isort 4.3.16-py37_0 --> 4.3.21-py27_0 jdcal pkgs/main/linux-64::jdcal-1.4-py37_0 --> pkgs/main/noarch::jdcal-1.4.1-py_0 jinja2 2.10-py37_0 --> 2.10.1-py27_0 jupyter_client pkgs/main/linux-64::jupyter_client-5.~ --> pkgs/main/noarch::jupyter_client-5.3.1-py_0 jupyter_core pkgs/main/linux-64::jupyter_core-4.4.~ --> pkgs/main/noarch::jupyter_core-4.5.0-py_0 kiwisolver 1.0.1-py37hf484d3e_0 --> 1.1.0-py27he6710b0_0 lazy-object-proxy 1.3.1-py37h14c3975_2 --> 1.4.1-py27h7b6447c_0 llvmlite 0.28.0-py37hd408876_0 --> 0.29.0-py27hd408876_0 lxml 4.3.2-py37hefd8a0e_0 --> 4.3.4-py27hefd8a0e_0 mkl-service 1.1.2-py37he904b0f_5 --> 2.0.2-py27h7b6447c_0 mkl_fft 1.0.10-py37ha843d7b_0 --> 1.0.12-py27ha843d7b_0 nbconvert pkgs/main/linux-64::nbconvert-5.4.1-p~ --> pkgs/main/noarch::nbconvert-5.5.0-py_0 nltk 3.4-py37_1 --> 3.4.4-py27_0 numba 0.43.1-py37h962f231_0 --> 0.45.0-py27h962f231_0 numpy 1.16.2-py37h7e9f1db_0 --> 1.16.4-py27h7e9f1db_0 numpy-base 1.16.2-py37hde5b4d6_0 --> 1.16.4-py27hde5b4d6_0 numpydoc pkgs/main/linux-64::numpydoc-0.8.0-py~ --> pkgs/main/noarch::numpydoc-0.9.1-py_0 openpyxl pkgs/main/linux-64::openpyxl-2.6.1-py~ --> pkgs/main/noarch::openpyxl-2.6.2-py_0 openssl 1.1.1b-h7b6447c_1 --> 1.1.1c-h7b6447c_1 parso pkgs/main/linux-64::parso-0.3.4-py37_0 --> pkgs/main/noarch::parso-0.5.0-py_0 partd pkgs/main/linux-64::partd-0.3.10-py37~ --> pkgs/main/noarch::partd-1.0.0-py_0 pathlib2 2.3.3-py37_0 --> 2.3.4-py27_0 pexpect 4.6.0-py37_0 --> 4.7.0-py27_0 pillow 5.4.1-py37h34e0f95_0 --> 6.1.0-py27h34e0f95_0 pip 19.0.3-py37_0 --> 19.1.1-py27_0 pluggy pkgs/main/linux-64::pluggy-0.9.0-py37~ --> pkgs/main/noarch::pluggy-0.11.0-py_0 prometheus_client pkgs/main/linux-64::prometheus_client~ --> pkgs/main/noarch::prometheus_client-0.7.1-py_0 psutil 5.6.1-py37h7b6447c_0 --> 5.6.3-py27h7b6447c_0 pygments pkgs/main/linux-64::pygments-2.3.1-py~ --> pkgs/main/noarch::pygments-2.4.2-py_0 pyparsing pkgs/main/linux-64::pyparsing-2.3.1-p~ --> pkgs/main/noarch::pyparsing-2.4.0-py_0 pysocks 1.6.8-py37_0 --> 1.7.0-py27_0 pytest 4.3.1-py37_0 --> 4.5.0-py27_0 python-libarchive~ 2.8-py37_6 --> 2.8-py27_11 pytz pkgs/main/linux-64::pytz-2018.9-py37_0 --> pkgs/main/noarch::pytz-2019.1-py_0 pywavelets 1.0.2-py37hdd07704_0 --> 1.0.3-py27hdd07704_1 pyyaml 5.1-py37h7b6447c_0 --> 5.1.1-py27h7b6447c_0 qtconsole pkgs/main/linux-64::qtconsole-4.4.3-p~ --> pkgs/main/noarch::qtconsole-4.5.2-py_0 qtpy pkgs/main/linux-64::qtpy-1.7.0-py37_1 --> pkgs/main/noarch::qtpy-1.8.0-py_0 requests 2.21.0-py37_0 --> 2.22.0-py27_0 rope pkgs/main/linux-64::rope-0.12.0-py37_0 --> pkgs/main/noarch::rope-0.14.0-py_0 setuptools 40.8.0-py37_0 --> 41.0.1-py27_0 snowballstemmer pkgs/main/linux-64::snowballstemmer-1~ --> pkgs/main/noarch::snowballstemmer-1.9.0-py_0 sphinxcontrib-web~ pkgs/main/linux-64::sphinxcontrib-web~ --> pkgs/main/noarch::sphinxcontrib-websupport-1.1.2-py_0 spyder 3.3.3-py37_0 --> 3.3.6-py27_0 spyder-kernels 0.4.2-py37_0 --> 0.5.1-py27_0 sqlalchemy 1.3.1-py37h7b6447c_0 --> 1.3.5-py27h7b6447c_0 statsmodels 0.9.0-py37h035aef0_0 --> 0.10.1-py27hdd07704_0 sympy 1.3-py37_0 --> 1.4-py27_0 tblib pkgs/main/linux-64::tblib-1.3.2-py37_0 --> pkgs/main/noarch::tblib-1.4.0-py_0 terminado 0.8.1-py37_1 --> 0.8.2-py27_0 toolz pkgs/main/linux-64::toolz-0.9.0-py37_0 --> pkgs/main/noarch::toolz-0.10.0-py_0 tqdm pkgs/main/linux-64::tqdm-4.31.1-py37_1 --> pkgs/main/noarch::tqdm-4.32.1-py_0 urllib3 1.24.1-py37_0 --> 1.24.2-py27_0 werkzeug pkgs/main/linux-64::werkzeug-0.14.1-p~ --> pkgs/main/noarch::werkzeug-0.15.4-py_0 wheel 0.33.1-py37_0 --> 0.33.4-py27_0 widgetsnbextension 3.4.2-py37_0 --> 3.5.0-py27_0 wrapt 1.11.1-py37h7b6447c_0 --> 1.11.2-py27h7b6447c_0 xlsxwriter pkgs/main/linux-64::xlsxwriter-1.1.5-~ --> pkgs/main/noarch::xlsxwriter-1.1.8-py_0 zict pkgs/main/linux-64::zict-0.1.4-py37_0 --> pkgs/main/noarch::zict-1.0.0-py_0 zipp pkgs/main/linux-64::zipp-0.3.3-py37_1 --> pkgs/main/noarch::zipp-0.5.1-py_0

The following packages will be DOWNGRADED:

_ipyw_jlab_nb_ext~ 0.1.0-py37_0 --> 0.1.0-py27_0 alabaster 0.7.12-py37_0 --> 0.7.12-py27_0 anaconda-client 1.7.2-py37_0 --> 1.7.2-py27_0 anaconda-navigator 1.9.7-py37_0 --> 1.9.7-py27_0 asn1crypto 0.24.0-py37_0 --> 0.24.0-py27_0 astroid 2.2.5-py37_0 --> 1.6.5-py27_0 astropy 3.1.2-py37h7b6447c_0 --> 2.0.9-py27hdd07704_0 atomicwrites 1.3.0-py37_1 --> 1.3.0-py27_1 attrs 19.1.0-py37_1 --> 19.1.0-py27_1 backcall 0.1.0-py37_0 --> 0.1.0-py27_0 backports.os 0.1.1-py37_0 --> 0.1.1-py270 backports.shutil~ 1.0.0-py37_2 --> 1.0.0-py27_2 beautifulsoup4 4.7.1-py37_1 --> 4.6.3-py27_0 bkcharts 0.2-py37_0 --> 0.2-py27_0 bleach 3.1.0-py37_0 --> 3.1.0-py27_0 boto 2.49.0-py37_0 --> 2.49.0-py27_0 bottleneck 1.2.1-py37h035aef0_1 --> 1.2.1-py27h035aef0_1 click 7.0-py37_0 --> 7.0-py27_0 clyent 1.2.2-py37_1 --> 1.2.2-py27_1 colorama 0.4.1-py37_0 --> 0.4.1-py27_0 contextlib2 0.5.5-py37_0 --> 0.5.5-py27_0 cycler 0.10.0-py37_0 --> 0.10.0-py27_0 decorator 4.4.0-py37_1 --> 4.4.0-py27_1 entrypoints 0.3-py37_0 --> 0.3-py27_0 et_xmlfile 1.0.1-py37_0 --> 1.0.1-py27_0 future 0.17.1-py37_0 --> 0.17.1-py27_0 gevent 1.4.0-py37h7b6447c_0 --> 1.4.0-py27h7b6447c_0 gmpy2 2.0.8-py37h10f8cd9_2 --> 2.0.8-py27h10f8cd9_2 greenlet 0.4.15-py37h7b6447c_0 --> 0.4.15-py27h7b6447c_0 h5py 2.9.0-py37h7918eee_0 --> 2.9.0-py27h7918eee_0 heapdict 1.0.0-py37_2 --> 1.0.0-py27_2 html5lib 1.0.1-py37_0 --> 1.0.1-py27_0 idna 2.8-py37_0 --> 2.8-py27_0 imageio 2.5.0-py37_0 --> 2.5.0-py27_0 imagesize 1.1.0-py37_0 --> 1.1.0-py27_0 ipykernel 5.1.0-py37h39e3cac_0 --> 4.10.0-py27_0 ipython 7.4.0-py37h39e3cac_0 --> 5.8.0-py27_0 ipython_genutils 0.2.0-py37_0 --> 0.2.0-py27_0 itsdangerous 1.1.0-py37_0 --> 1.1.0-py27_0 jedi 0.13.3-py37_0 --> 0.13.3-py27_0 jsonschema 3.0.1-py37_0 --> 3.0.1-py27_0 jupyter 1.0.0-py37_7 --> 1.0.0-py27_7 jupyter_console 6.0.0-py37_0 --> 5.2.0-py27_1 jupyterlab 0.35.4-py37hf63ae98_0 --> 0.33.11-py27_0 keyring 18.0.0-py37_0 --> 18.0.0-py27_0 locket 0.2.0-py37_1 --> 0.2.0-py27_1 markupsafe 1.1.1-py37h7b6447c_0 --> 1.1.1-py27h7b6447c_0 matplotlib 3.0.3-py37h5429711_0 --> 2.2.3-py27hb69df0a_0 mccabe 0.6.1-py37_1 --> 0.6.1-py27_1 mistune 0.8.4-py37h7b6447c_0 --> 0.8.4-py27h7b6447c_0 mkl_random 1.0.2-py37hd81dba3_0 --> 1.0.2-py27hd81dba3_0 more-itertools 6.0.0-py37_0 --> 5.0.0-py27_0 mpmath 1.1.0-py37_0 --> 1.1.0-py27_0 msgpack-python 0.6.1-py37hfd86e86_1 --> 0.6.1-py27hfd86e86_1 multipledispatch 0.6.0-py37_0 --> 0.6.0-py27_0 navigator-updater 0.2.1-py37_0 --> 0.2.1-py27_0 nbformat 4.4.0-py37_0 --> 4.4.0-py27_0 networkx 2.2-py37_1 --> 2.2-py27_1 nose 1.3.7-py37_2 --> 1.3.7-py27_2 notebook 5.7.8-py37_0 --> 5.7.8-py27_0 numexpr 2.6.9-py37h9e4a6bb_0 --> 2.6.9-py27h9e4a6bb_0 olefile 0.46-py37_0 --> 0.46-py27_0 packaging 19.0-py37_0 --> 19.0-py27_0 pandas 0.24.2-py37he6710b0_0 --> 0.24.2-py27he6710b0_0 pandocfilters 1.4.2-py37_1 --> 1.4.2-py27_1 path.py 11.5.0-py37_0 --> 11.1.0-py27_0 patsy 0.5.1-py37_0 --> 0.5.1-py27_0 pep8 1.7.1-py37_0 --> 1.7.1-py27_0 pickleshare 0.7.5-py37_0 --> 0.7.5-py27_0 pkginfo 1.5.0.1-py37_0 --> 1.5.0.1-py27_0 ply 3.11-py37_0 --> 3.11-py27_0 prompt_toolkit 2.0.9-py37_0 --> 1.0.15-py27_0 ptyprocess 0.6.0-py37_0 --> 0.6.0-py27_0 py 1.8.0-py37_0 --> 1.8.0-py27_0 py-lief 0.9.0-py37h7725739_2 --> 0.9.0-py27h7725739_2 pycodestyle 2.5.0-py37_0 --> 2.5.0-py27_0 pycosat 0.6.3-py37h14c3975_0 --> 0.6.3-py27h14c3975_0 pycparser 2.19-py37_0 --> 2.19-py27_0 pycrypto 2.6.1-py37h14c3975_9 --> 2.6.1-py27h14c3975_9 pycurl 7.43.0.2-py37h1ba5d50_0 --> 7.43.0.2-py27h1ba5d50_0 pyflakes 2.1.1-py37_0 --> 2.1.1-py27_0 pylint 2.3.1-py37_0 --> 1.9.2-py27_0 pyodbc 4.0.26-py37he6710b0_0 --> 4.0.26-py27he6710b0_0 pyopenssl 19.0.0-py37_0 --> 19.0.0-py27_0 pyqt 5.9.2-py37h05f1152_2 --> 5.9.2-py27h05f1152_2 pyrsistent 0.14.11-py37h7b6447c_0 --> 0.14.11-py27h7b6447c_0 pytables 3.5.1-py37h71ec239_0 --> 3.5.1-py27h71ec239_0 pytest-arraydiff 0.3-py37h39e3cac_0 --> 0.3-py27h39e3cac_0 pytest-astropy 0.5.0-py37_0 --> 0.5.0-py27_0 pytest-doctestplus 0.3.0-py37_0 --> 0.3.0-py27_0 pytest-openfiles 0.3.2-py37_0 --> 0.3.2-py27_0 pytest-remotedata 0.3.1-py37_0 --> 0.3.1-py27_0 python 3.7.3-h0371630_0 --> 2.7.16-h9bab390_0 python-dateutil 2.8.0-py37_0 --> 2.8.0-py27_0 pyzmq 18.0.0-py37he6710b0_0 --> 18.0.0-py27he6710b0_0 qtawesome 0.5.7-py37_1 --> 0.5.7-py27_1 ruamel_yaml 0.15.46-py37h14c3975_0 --> 0.15.46-py27h14c3975_0 scikit-image 0.14.2-py37he6710b0_0 --> 0.14.2-py27he6710b0_0 scikit-learn 0.20.3-py37hd81dba3_0 --> 0.20.3-py27hd81dba3_0 scipy 1.2.1-py37h7c811a0_0 --> 1.2.1-py27h7c811a0_0 seaborn 0.9.0-py37_0 --> 0.9.0-py27_0 send2trash 1.5.0-py37_0 --> 1.5.0-py27_0 simplegeneric 0.8.1-py37_2 --> 0.8.1-py27_2 singledispatch 3.4.0.3-py37_0 --> 3.4.0.3-py27_0 sip 4.19.8-py37hf484d3e_0 --> 4.19.8-py27hf484d3e_0 six 1.12.0-py37_0 --> 1.12.0-py27_0 sortedcollections 1.1.2-py37_0 --> 1.1.2-py27_0 sortedcontainers 2.1.0-py37_0 --> 2.1.0-py27_0 sphinx 1.8.5-py37_0 --> 1.8.5-py27_0 sphinxcontrib 1.0-py37_1 --> 1.0-py27_1 testpath 0.4.2-py37_0 --> 0.4.2-py27_0 tornado 6.0.2-py37h7b6447c_0 --> 5.1.1-py27h7b6447c_0 traitlets 4.3.2-py37_0 --> 4.3.2-py27_0 unicodecsv 0.14.1-py37_0 --> 0.14.1-py27_0 wcwidth 0.1.7-py37_0 --> 0.1.7-py27_0 webencodings 0.5.1-py37_1 --> 0.5.1-py27_1 wurlitzer 1.0.2-py37_0 --> 1.0.2-py27_0 xlrd 1.2.0-py37_0 --> 1.2.0-py27_0 xlwt 1.3.0-py37_0 --> 1.3.0-py27_0

Proceed ([y]/n)?

JovanVeljanoski commented 5 years ago

Hi,

Can you please try installing from conda-forge ?

conda install -c conda-forge vaex

Best from a clean env.

Cheers, Jovan.

fprada commented 5 years ago

Thanks! Now it seems to work following your suggestions. BUT, I get this error when reading my hdf5 table,

In [15]: ds = vaex.open('/home/users/dae/ishiyama/Uchuu/Rockstar/007/out_7.rockstar.0.hdf5')
ERROR:MainThread:vaex:error opening '/home/users/dae/ishiyama/Uchuu/Rockstar/007/out_7.rockstar.0.hdf5'

ValueError Traceback (most recent call last)

in ----> 1 ds = vaex.open('/home/users/dae/ishiyama/Uchuu/Rockstar/007/out_7.rockstar.0.hdf5') ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/__init__.py in open(path, convert, shuffle, copy_index, *args, **kwargs) 189 ds = from_csv(path, copy_index=copy_index, **kwargs) 190 else: --> 191 ds = vaex.file.open(path, *args, **kwargs) 192 if convert and ds: 193 ds.export_hdf5(filename_hdf5, shuffle=shuffle) ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/file/__init__.py in open(path, *args, **kwargs) 39 break 40 if dataset_class: ---> 41 dataset = dataset_class(path, *args, **kwargs) 42 return dataset 43 ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/hdf5/dataset.py in __init__(self, filename, write) 84 self.h5table_root_name = None 85 self._version = 1 ---> 86 self._load() 87 88 def write_meta(self): ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/hdf5/dataset.py in _load(self) 186 if len(root_datasets): 187 # if we have datasets at the root, we assume 'version 1' --> 188 self._load_columns(self.h5file) 189 self.h5table_root_name = "/" 190 ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/hdf5/dataset.py in _load_columns(self, h5data, first) 341 self.add_column(column_name, self._map_hdf5_array(data, column['mask'])) 342 else: --> 343 self.add_column(column_name, self._map_hdf5_array(data)) 344 else: 345 transposed = shape[1] < shape[0] ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/dataframe.py in add_column(self, name, f_or_array, dtype) 2757 if len(self) == len(ar): 2758 raise ValueError("Array is of length %s, while the length of the DataFrame is %s due to the filtering, the (unfiltered) length is %s." % (len(ar), len(self), self.length_unfiltered())) -> 2759 raise ValueError("array is of length %s, while the length of the DataFrame is %s" % (len(ar), self.length_original())) 2760 # assert self.length_unfiltered() == len(data), "columns should be of equal length, length should be %d, while it is %d" % ( self.length_unfiltered(), len(data)) 2761 self.columns[name] = f_or_array ValueError: array is of length 16, while the length of the DataFrame is 69882354 In [16]:
JovanVeljanoski commented 5 years ago

Hi,

How did you create the hdf5 file? The data can be store in multiple ways inside the hdf5 file.

fprada commented 5 years ago

Hi,

the hdf5 files have been created by our group with our code based on c starting from a huge ASCII file that it is splited and then converted to several hdf5 files. I guess we are not following a standard format? You can take a look at the code here https://bitbucket.org/cnvega/rockstar_outputs/src/default/

Your help is very much welcome! Thanks.

JovanVeljanoski commented 5 years ago

Easiest / fastest way would probably be to use vaex to read the ascii file and output a single (or multiple) hdf5 files. Then you are guaranteed compatibility. Maybe I could help with this, if you send me a couple of lines from that ascii file?

In general, you can use vaex.read_csv which uses pandas.read_csv to read a text file in memory. It does not have to be a csv file, it can be an ascii, you can define the delimiter, and I think it does have support for standard ascii files.

I hope this helps.

fprada commented 5 years ago

Thank you! Indeed starting from the ascii certainly would be the best. It'd be great if you can help with this. Please find below the first lines of the ascii (halo catalog) which includes the header and data for 4 halos:

ID DescID Mvir Vmax Vrms Rvir Rs Np X Y Z VX VY VZ JX JY JZ Spin rs_klypin Mvir_all M200b M200c M500c M2500c Xoff Voff spin_bullock b_to_a c_to_a A[x] A[y] A[z] b_to_a(500c) c_to_a(500c) Ax Ay Az T/|U| M_pe_Behro

ozi M_pe_Diemer Halfmass_Radius rvmax PID

a = 0.537760

Om = 0.308900; Ol = 0.691100; h = 0.677400

FOF linking length: 0.280000

Unbound Threshold: 0.500000; FOF Refinement Threshold: 0.700000

Particle mass: 3.27018e+08 Msun/h

Box size: 2000.000000 Mpc/h

Force resolution assumed: 0.00427 Mpc/h

Units: Masses in Msun / h

Units: Positions in Mpc / h (comoving)

Units: Velocities in km / s (physical, peculiar)

Units: Halo Distances, Lengths, and Radii in kpc / h (comoving)

Units: Angular Momenta in (Msun/h) (Mpc/h) km/s (physical)

Units: Spins are dimensionless

Np is an internal debugging quantity.

Rockstar Version: 0.99.9-RC3+

11721 -1 3.270e+09 28.59 31.33 35.278 7.694 45 1.01573 0.49961 0.27360 9.30 -121.61 -692.93 2.487e+08 -1.555e+08 -8.470e+07 0.06865 7.69378 3.9242e+09 3.2702e+09 3.2702e+09 0.0000e+00 0.0000e+00 5.81215 11.02 0.12781 0.11185 0.04569 -6.8 7745 -1.79801 20.51143 0.00000 0.00000 0.00000 0.00000 0.00000 1.2660 3.083e+09 2.943e+09 22.007 28.796 11723 93688 -1 4.251e+09 30.67 38.40 38.502 9.779 39 3.55980 3.13285 1.20113 -109.68 -70.84 -368.92 3.505e+08 3.955e+08 -1.963e+08 0.11550 9.77894 4.5783e+09 4.2512e+09 2.9432e+09 0.0000e+00 0.0000e+00 13.06878 9.52 0.15239 0.09049 0.00000 1.2 5496 12.71236 13.98056 0.00000 0.00000 0.00000 0.00000 0.00000 1.5498 5.686e+09 4.251e+09 28.150 36.152 -1 11722 -1 9.810e+08 19.56 5.01 23.616 4.395 50 1.16682 0.36458 0.13502 48.84 -281.26 -530.28 2.162e+08 -9.036e+07 -8.889e+07 1.50052 4.39534 1.3081e+09 9.8105e+08 9.8105e+08 0.0000e+00 0.0000e+00 16.63480 44.30 0.78025 0.00000 0.00000 0.0 0000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 14.2525 9.997e+08 3.270e+08 18.805 20.511 11723 62160 -1 3.009e+10 61.51 55.45 73.920 13.338 106 2.04250 0.29428 4.37749 -116.57 -18.30 -463.31 4.077e+08 2.614e+09 -1.089e+09 0.03017 13.33807 3.0086e+10 3.0413e+10 2.7470e+10 1.9948e+10 7.8484e+09 6.86375 0.00 0.02965 0.74100 0.59323 0 .55459 8.36537 3.35550 0.63043 0.55334 0.83416 7.19021 -0.82567 0.4740 2.806e+10 2.485e+10 34.376 52.545 -1

JovanVeljanoski commented 5 years ago

Ah, this is from ROCKSTAR the clustering algorithm right? It should be straightforward to read in the data than.

All you need to do is this

import vaex
names = ['ID', 'DescID', 'Mvir', 'Vmax', 'Vrms', 'Rvir', 'Rs', 'Np', 'X', 'Y', 'Z', 'VX', 'VY', 'VZ', 'JX', 'JY', 'JZ', 'Spin', 'rs_klypin', 'Mvir_all', 'M200b', 'M200c', 'M500c', 'M2500c', 'Xoff', 'Voff', 'spin_bullock', 'b_to_a', 'c_to_a', 'A[x]', 'A[y]', 'A[z]', 'b_to_a(500c)', 'c_to_a(500c)', 'Ax', 'Ay', 'Az', 'T/|U|', 'M_pe_Behroozi', 'M_pe_Diemer', 'Halfmass_Radius', 'rvmax', 'PID']

ds = vaex.read_csv(filepath_or_buffer='data.txt', delim_whitespace=True, comment='#', header=None, names=names, copy_index=False)

where data.txt is just the data you sent above copied to a plain text file, and names is a list with the names of each column, which I took from the data you sent.

Alternatively, you can set the header to be inferred. That requires the top non-comment line of the file to contain all column names. You can either edit the file to achieve this, or perhaps adjust the output of ROCKSTAR such that the header is a bit different.

Hope this helps. Please let me know if this works.

fprada commented 5 years ago

That's right! We are using ROCKSTAR to create the halo catalogs. In this case it is for a new two-trillion N-body simulation! So, ROCKSTAR provides an ASCII file for each time epoch. The ASCII file is huge, we have more than 4 billion halos! This is why we converted the ASCII file to hdf5, and also splitted to help with the file transfer.

OK. Good. Let me then follow your advise and use vaex.read_csv ...

Thank you!

fprada commented 5 years ago

Hi Jovan,

I forgot to ask. Once we read the ASCII file in vaex how can we convert it into several hdf5 files?

Thanks!

JovanVeljanoski commented 5 years ago

Once you read everything in:

df.export_hdf5('/somewhere/on/disk/file.hdf5', progress=True)

You may want to read through https://vaex.readthedocs.io/en/latest/tutorial.html just to maximize the value from using vaex.

Cheers

fprada commented 5 years ago

Got it, thanks! Let me work on it. Keep in touch. Best

fprada commented 5 years ago

Hi,

vaex read the ASCII file well and it worked fine, great!

When I want to create a hdf5 version, following df.export_hdf5('/somewhere/on/disk/file.hdf5', progress=True) then I get this error


OSError Traceback (most recent call last)

in ----> 1 ds.export_hdf5("test.hdf5", progress=True) ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/dataframe.py in export_hdf5(self, path, column_names, byteorder, shuffle, selection, progress, virtual, sort, ascending) 5066 """ 5067 import vaex.export -> 5068 vaex.export.export_hdf5(self, path, column_names, byteorder, shuffle, selection, progress=progress, virtual=virtual, sort=sort, ascending=ascending) 5069 5070 def export_fits(self, path, column_names=None, shuffle=False, selection=False, progress=None, virtual=False, sort=None, ascending=True): ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/export.py in export_hdf5(dataset, path, column_names, byteorder, shuffle, selection, progress, virtual, sort, ascending) 340 kwargs = locals() 341 import vaex.hdf5.export --> 342 vaex.hdf5.export.export_hdf5(**kwargs) 343 344 ~/.conda/envs/vaexenv/lib/python3.7/site-packages/vaex/hdf5/export.py in export_hdf5(dataset, path, column_names, byteorder, shuffle, selection, progress, virtual, sort, ascending) 124 selection = "default" 125 # first open file using h5py api --> 126 with h5py.File(path, "w") as h5file_output: 127 128 h5table_output = h5file_output.require_group("/table") ~/.conda/envs/vaexenv/lib/python3.7/site-packages/h5py/_hl/files.py in __init__(self, name, mode, driver, libver, userblock_size, swmr, rdcc_nslots, rdcc_nbytes, rdcc_w0, track_order, **kwds) 392 fid = make_fid(name, mode, userblock_size, 393 fapl, fcpl=make_fcpl(track_order=track_order), --> 394 swmr=swmr) 395 396 if swmr_support: ~/.conda/envs/vaexenv/lib/python3.7/site-packages/h5py/_hl/files.py in make_fid(name, mode, userblock_size, fapl, fcpl, swmr) 174 fid = h5f.create(name, h5f.ACC_EXCL, fapl=fapl, fcpl=fcpl) 175 elif mode == 'w': --> 176 fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl) 177 elif mode == 'a': 178 # Open in append mode (read/write). h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/_objects.pyx in h5py._objects.with_phil.wrapper() h5py/h5f.pyx in h5py.h5f.create() OSError: Unable to create file (unable to truncate a file which is already open)
maartenbreddels commented 5 years ago

Odd, are you maybe writing to a file you already opened? Can you change the filename?

fprada commented 5 years ago

Yeah, I change the filename and the error persist. Yet, I've noticed that the hdf5 file is created in the directory ...

JovanVeljanoski commented 5 years ago

Hi @fprada

Can you tell me which version of h5py you have installed in the same env as vaex?

Can you try writing to a different directory altogether? You can also try exporting to arrow or parquet format..

Also if that does not work, can you give us the output of df.dtypes

Cheers

JovanVeljanoski commented 5 years ago

(Ups sorry closed it by mistake).

On the positive side, I think i figured it out. I think some of the column names are too exotic for h5py, in particular things like 'T/|U|' and potentially 'A[x]' and 'c_to_a(500c)'.

I suggest to rename the column names to contain only letters (lower or upper case) and numbers, and underscores. Other characters such as [, (, \ / ? etc.. may raise issues. I am not sure if this is due to vaex, or h5py at this point.

Please try using more simple column names, and exporting than.

Cheers.

maartenbreddels commented 5 years ago

@fprada With the file example you gave I could reproduce your issue, but there were actually two issues.

The first time you run it, you see a different stacktrace than the second time (that confused me!).

The first time it got confused by T/|U|', which h5py interprets as a group 'T' with a dataset '|U'|. This should be fixed by #370 (I'll keep this open till it is released).

The second time it complains that the file is already open (which is the stacktrace you gave), I think we can improve that as well.

The workaround, for now, is what Jovan suggested:

import vaex
names = ['ID', 'DescID', 'Mvir', 'Vmax', 'Vrms', 'Rvir', 'Rs', 'Np', 'X', 'Y', 'Z', 'VX', 'VY', 'VZ', 'JX', 'JY', 'JZ', 'Spin', 'rs_klypin', 'Mvir_all', 'M200b', 'M200c', 'M500c', 'M2500c', 'Xoff', 'Voff', 'spin_bullock', 'b_to_a', 'c_to_a', 'A[x]', 'A[y]', 'A[z]', 'b_to_a(500c)', 'c_to_a(500c)', 'Ax', 'Ay', 'Az', 'T/|U|', 'M_pe_Behroozi', 'M_pe_Diemer', 'Halfmass_Radius', 'rvmax', 'PID']
names = [vaex.utils.find_valid_name(k) for k in names]
df = vaex.read_csv('somefile.csv', delim_whitespace=True, comment='#', header=None, names=names, copy_index=False)
fprada commented 5 years ago

Excellent, it works! Thanks very much Maarten and Jovan for your help. Now it creates the hdf5 file, and when I read it with vaex everything looks fine. Great.

Now, if I read a much bigger ascii file (230 GB) with vaex takes really long (still reading after 1.5 hrs, it's taking all 128GB RAM running on 1 CPU). Is there a way to speed up the reading? Why does it take all that RAM?

There are 661592956 rows in the original ascii file. Note that this is a file with only 1/8 of the entire Rockstar data, which contains about 5 billion rows for one redshift snapshot of the simulation :-)

Let me also mentioned that when I exported the previous ascii file to hdf5, I noticed that its size is about the same. Our hdf5 file created with c has about half-size. Is there a way in vaex to reduce the size (some compression?) when exporting hdf5? This is our main interest of having the data in hdf5 instead of ascii.

I should mentioned that our interest on vaex is to provide efficient manipulation and analysis of our data for the entire astronomical community. We do plan to have a first data release soon. Thanks again for all your support!

JovanVeljanoski commented 5 years ago

Hi @fprada

I am happy to hear that it works.

Perhaps it is best to open another issue regarding any follow up questions, as to not divert this threat too much, but I will offer some advice here.

To your 1st point.. well you are trying to read a 230 GB file, but you only have 128 GB of ram, so that sets a limit on how much you can effectively read in memory at one time. Your computer is probably using the swap disk as an additional ram, but this is much slower, and is best avoided if possible.

How to deal with this: we will eventually provide support for converting larger-than-memory text (csv, ascii) files to hdf5 out-of-the box, but we are busy working on other stuff right now, so this will perhaps happen in a month or two.

In the meantime you can do the following: familiarize yourself with pandas.read_csv, it is what vaex uses to read csv/ascii files. You will see that pandas.read_csv supports reading chunks of files, so read only as many lines as you can fit into the RAM of your machine. Export that to hdf5. Then do the same with next portion of your massive text file and so on. You can write a loop/iterator to do this for you. At the end you will end up with a bunch of hdf5 files, which you can open all together with vaex.open_many, and the result will be a single DataFrame, just as if you opened a single massive hdf5 file. If you prefer to store a single hdf5 file, you can now export this DataFrame into a single hdf5 file and remove the smaller hdf5 files. There may be (small?) performance benefits to working with a single hdf5 file, but it should not matter much.

Once you have the data in hdf5, regardless of the size, you can work with the entire data, as vaex does memory mapping, so you are not actually reading the 200+GB into memory all at one time, as you would to a typical csv or ascii file. This is why we are converting the data to hdf5 (or arrow, or parquet).

About the size of the hdf5: when the ascii data is read by python, it is stored as float64 data type in memory, and as such it is exported to the hdf5 file, which takes more space than the few decimal places you have in the raw ascii file. What you could do is for instance use float32 if you do not need the extra precision. This way the data file will be smaller.

We would be very grateful if you cite/mention the use of this project if it helps you out :)

fprada commented 5 years ago

Hi,

FYI. after more than 2.5hr the reading hasn't finished ... Still going.

Thanks.

fprada commented 5 years ago

Thanks Jovan,

that's why we splitted the orignal ascii big file into several smaller hdf5 files. We have done that with our own c code. But unfortunately vaex cannot recognise our format. Likely because that issue you pointed out with the names of the columns. If we can solve this, then the best would be to use vaex.open_many to read the many hdf5 files.

I will take a look at the pandas.read_csv ...

It'd be a please to acknowledge vaex. Hopefully we will make use of it once we are able to make it work for our application ;-) It is an amazing tool! Congratulations.

Best.

maartenbreddels commented 5 years ago

But unfortunately vaex cannot recognise our format.

It might be possible to get it compatible, but I'm not sure what is more work now.

Thanks for your positive words, glad you find it helpful.

I'll close this issue, feel free to open new ones for new issues.

cheers,

Maarten

fprada commented 5 years ago

Thank you! Thanks Maarten and Jovan. I'll be back ;-) Cheers, Francisco.