valiantljk / h5spark

Supporting Hierarchical Data Format and Rich Parallel I/O Interface in Spark
Other
42 stars 25 forks source link

parallelize along user specified dimension #11

Open valiantljk opened 8 years ago

valiantljk commented 8 years ago

Currently, H5Spark parallelize the IO along the slowest dimension, i.e., the dimension that changes slowest on disks. For example, for a 2D C array x[10][200], the h5spark will choose the first dimension to partition, and then the maximum partition it can have is only 10, which is also the maximum degree of parallelism.

If we want to parallelize along any user-specified dimension, the current code needs a little bit modification.