vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.22k stars 590 forks source link

[BUG-REPORT] Fresh install of vaex cannot open files written to disk #2283

Open Ben-Epstein opened 1 year ago

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description

On a fresh notebook without vaex installed, run the following

!pip install vaex-core==4.15.0 vaex-hdf5==0.12.3

import vaex
import numpy as np

df = vaex.from_arrays(id=list(range(100_000)), emb=np.random.rand(100_000, 768))
df.export('file.hdf5')
df.export('file1.hdf5')

vaex.open("file*.hdf5")

You see the following

image

Software information

Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc..).

Ben-Epstein commented 1 year ago

I can get around this by adding this to the top of the file

import vaex
import vaex.hdf5.dataset
import vaex.arrow.opener

vaex.dataset.opener_classes = [
    vaex.hdf5.dataset.Hdf5MemoryMapped,
     vaex.hdf5.dataset.AmuseHdf5MemoryMapped,
     vaex.hdf5.dataset.Hdf5MemoryMappedGadget,
     vaex.arrow.opener.ArrowOpener,
     vaex.arrow.opener.FeatherOpener,
     vaex.arrow.opener.ParquetOpener
]

vaex.open("file*.hdf5")

seems like an issue registering the classes here - looks like it's not finding any classes. I added debug logging and the line trying opener is never called

Ben-Epstein commented 1 year ago

Interestingly enough, while that works ^ I get new errors when trying to actually use the dataframe

import vaex
import numpy as np

df = vaex.from_arrays(id=list(range(100_000)), emb=np.random.rand(100_000, 768))
df.export('file.hdf5')
df.export('file1.hdf5')

df = vaex.open("file*.hdf5")
df["id"].sum()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[7], line 9
      6 df.export('file1.hdf5')
      8 df = vaex.open("file*.hdf5")
----> 9 df["id"].sum()

File ~.venv/lib/python3.8/site-packages/vaex/expression.py:923, in Expression.sum(self, axis, binby, limits, shape, selection, delay, progress)
    921     del kwargs['dtype']
    922     kwargs['expression'] = expression.expression
--> 923     return self.ds.sum(**kwargs)
    924 else:
    925     return expression

File ~.venv/lib/python3.8/site-packages/vaex/dataframe.py:1130, in DataFrame.sum(self, expression, binby, limits, shape, selection, delay, progress, edges, array_type)
   1107 @docsubst
   1108 @stat_1d
   1109 def sum(self, expression, binby=[], limits=None, shape=default_shape, selection=False, delay=False, progress=None, edges=False, array_type=None):
   1110     """Calculate the sum for the given expression, possible on a grid defined by binby
   1111 
   1112     Example:
   (...)
   1128     :return: {return_stat_scalar}
   1129     """
-> 1130     return self._compute_agg('sum', expression, binby, limits, shape, selection, delay, edges, progress, array_type=array_type)
   1131     @delayed
   1132     def finish(*sums):
   1133         return vaex.utils.unlistify(waslist, sums)

File ~venv/lib/python3.8/site-packages/vaex/dataframe.py:941, in DataFrame._compute_agg(self, name, expression, binby, limits, shape, selection, delay, edges, progress, extra_expressions, array_type)
    939 stats = [compute(expression, binners, selection=selection, edges=edges) for expression in expressions]
    940 var = finish(binners, *stats)
--> 941 return self._delay(delay, progressbar.exit_on(var))

File ~.venv/lib/python3.8/site-packages/vaex/dataframe.py:1780, in DataFrame._delay(self, delay, task, progressbar)
   1778     return task
   1779 else:
-> 1780     self.execute()
   1781     return task.get()

File ~.venv/lib/python3.8/site-packages/vaex/dataframe.py:421, in DataFrame.execute(self)
    419         print(repr(task))
    420 if self.executor.tasks:
--> 421     self.executor.execute()

File ~.venv/lib/python3.8/site-packages/vaex/execution.py:308, in ExecutorLocal.execute(self)
    307 def execute(self):
--> 308     for _ in self.execute_generator():
    309         pass

File ~.venv/lib/python3.8/site-packages/vaex/execution.py:378, in ExecutorLocal.execute_generator(self, use_async)
    376 run.nthreads = nthreads = self.thread_pool.nthreads
    377 task_checkers = vaex.tasks.create_checkers()
--> 378 memory_tracker = vaex.memory.create_tracker()
    379 vaex.memory.local.agg = memory_tracker
    380 # we track this for consistency

File ~.venv/lib/python3.8/site-packages/vaex/memory.py:37, in create_tracker()
     35 if cls is not None:
     36     return cls()
---> 37 raise ValueError(f"No memory tracker found with name {memory_tracker_type}")

ValueError: No memory tracker found with name default

@maartenbreddels any idea whats going on here and if it's a quick fix?

Ben-Epstein commented 1 year ago

I think this is probably related to https://github.com/vaexio/vaex/issues/2282

franz101 commented 1 year ago

Reproducing the error it seems to me related to pip install vaex vs pip install vaex-core or some other packages

franz101 commented 1 year ago

Maybe related: https://medium.com/google-colab/colab-updated-to-python-3-8-4922f9970a72

franz101 commented 1 year ago

For reproduction https://colab.research.google.com/drive/1EG9898VtmO19FwfZKd_LJzqkz_YVWwlE?usp=sharing

pip freeze diff check shows the packages that additionally get installed:

aplus==0.11.0
blake3==0.3.1
commonmark==0.9.1
frozendict==2.3.4
nest-asyncio==1.5.6
rich==12.6.0
franz101 commented 1 year ago

Found the issue: in file dataset.py line 57 for entry in pkg_resources.iter_entry_points(group='vaex.dataset.opener')

is empty except if you restart the kernel

franz101 commented 1 year ago

here is a fix: https://git.smhi.se/climix/climix/-/merge_requests/165/diffs

JovanVeljanoski commented 1 year ago

@Ben-Epstein is that a bug for a specific version of the vaex-hdf5 package? what happens if you update to 0.13 or 0.14?

Ben-Epstein commented 1 year ago

@JovanVeljanoski i think it's because of the legacy importlib, because it happens for arrow as well. I think @franz101 fix is the correct one https://github.com/vaexio/vaex/pull/2293

JovanVeljanoski commented 1 year ago

Pinging @maartenbreddels since he was working on something very similar recently.

franz101 commented 1 year ago

@JovanVeljanoski @maartenbreddels that issue was 100% related to pkg_resources not finding the lazy loaded readers. with importlib it works now. issue can be closed for now.

franz101 commented 1 year ago

although in the code it would make sense that the length of: pkg_resources.iter_entry_points is more than 0.

https://github.com/franz101/vaex-colab/blob/8f461d975ef6991fa7604e0e466b3552a8ad6dcf/packages/vaex-core/vaex/dataset.py#L68