Open tropicalpb opened 5 years ago
Are you already using dask? It will let you do all the computation out-of-core if necessary. That said 140 million points shouldn't take 100GB of peak memory, how are you loading the files and are you loading various columns you're not actually using? We generally recommend saving out your data to parquet files, which are much faster to load and support chunked reads. This way we routinely process 1 billion point datasets on our 16GB laptops.
we are already using dask. Is there any setting to enable to do out-of-core? We are supporting custom filter on other columns, so reading them in is a must.
fullpath = '/data/dir/*.csv' frame = dd.read_csv(fullpath, dtype=dtype, parse_dates=datetimeop, storage_options=None, assume_missing=True);
Dask works out of core by default, but will force everything to be in memory if there is a call todf.persist()
. Most of our example notebooks use .persist()
for speed, so if you want to work out of core, make sure you haven't copied that bit.
That said, working out of core with CSV files seems like a really bad idea -- you'd end up reprocessing the CSV file every single time you zoom or pan in any plot, so CSV-file processing will dwarf any computation actually done by Datashader. Converting to something more efficient like Parquet seems like a must unless you only ever want to do a single, static visualization per file.
thanks, we changed to parquet and memory comes down significantly but it still only uses a couple of cores for data loading and most of systems other 60 cores are sitting idle. We need to call persist() for the interactive speed, is there any way to parallel load?
How did you chunk the parquet data? As long as the data is sufficiently chunked I would expect the loading to happen in parallel.
The data is in Hadoop HDFS with default block size of 512MB. There are several hundred files, so even without the chunking just the number of files alone should be parallel, but that doesn't appear to be the case.
We are dealing with total of 140 million points from ~500 files. It is great that datashader can draw it out, but it also took 100GB of peak memory and about 40 minutes on a couple of cores getting the data into dataframe while the overall system cpu utilization is pretty spare. Is there anyway to both reduce the memory requirement and increase/parallelize the CPU usage?
Thanks