datashader memory consumption on very large dataset

holoviz / datashader

Quickly and accurately render even the largest data.

http://datashader.org

BSD 3-Clause "New" or "Revised" License

3.24k stars 363 forks source link

datashader memory consumption on very large dataset #638

Open tropicalpb opened 5 years ago

tropicalpb commented 5 years ago

We are dealing with total of 140 million points from ~500 files. It is great that datashader can draw it out, but it also took 100GB of peak memory and about 40 minutes on a couple of cores getting the data into dataframe while the overall system cpu utilization is pretty spare. Is there anyway to both reduce the memory requirement and increase/parallelize the CPU usage?

Thanks

philippjfr commented 5 years ago

Are you already using dask? It will let you do all the computation out-of-core if necessary. That said 140 million points shouldn't take 100GB of peak memory, how are you loading the files and are you loading various columns you're not actually using? We generally recommend saving out your data to parquet files, which are much faster to load and support chunked reads. This way we routinely process 1 billion point datasets on our 16GB laptops.

tropicalpb commented 5 years ago

we are already using dask. Is there any setting to enable to do out-of-core? We are supporting custom filter on other columns, so reading them in is a must.

fullpath = '/data/dir/*.csv' frame = dd.read_csv(fullpath, dtype=dtype, parse_dates=datetimeop, storage_options=None, assume_missing=True);

jbednar commented 5 years ago

Dask works out of core by default, but will force everything to be in memory if there is a call todf.persist(). Most of our example notebooks use .persist() for speed, so if you want to work out of core, make sure you haven't copied that bit.

That said, working out of core with CSV files seems like a really bad idea -- you'd end up reprocessing the CSV file every single time you zoom or pan in any plot, so CSV-file processing will dwarf any computation actually done by Datashader. Converting to something more efficient like Parquet seems like a must unless you only ever want to do a single, static visualization per file.

tropicalpb commented 5 years ago

thanks, we changed to parquet and memory comes down significantly but it still only uses a couple of cores for data loading and most of systems other 60 cores are sitting idle. We need to call persist() for the interactive speed, is there any way to parallel load?

philippjfr commented 5 years ago

How did you chunk the parquet data? As long as the data is sufficiently chunked I would expect the loading to happen in parallel.

tropicalpb commented 5 years ago

The data is in Hadoop HDFS with default block size of 512MB. There are several hundred files, so even without the chunking just the number of files alone should be parallel, but that doesn't appear to be the case.