valiantljk / h5spark

Supporting Hierarchical Data Format and Rich Parallel I/O Interface in Spark
Other
42 stars 25 forks source link

Load Balancer in h5spark #7

Open valiantljk opened 8 years ago

valiantljk commented 8 years ago

When loading multiple files, the file size can have a long-tailed distribution(see the figure), or an even distribution. In case of even distribution, we don't need to balance the load. But in case of long-tailed or other types of distribution, we do need to design a nice load balancer. Which will at least issue two major calls:

  1. Profile the files sizes and represent the distribution in an RDD
  2. Call the H5Spark's Load Balancer option when performing the h5read

The Load Balancer can be size oriented, or could be metadata-oriented. Currently, we want to implement the disk-size-oriented load balancer, in which each executor will roughly get the even size of data from disks. In the future we may consider the locality-based load balancer or metadata-oriented load balancer.

Take the picture as a motivating real case.
dayabay-muon-pre1

eracah commented 8 years ago

Yeah interesting. I think we could do something like sort the file sizes and then progressively assign the next biggest to the partition with the smallest current load. Sort of related to the LPT algorithm, but using disk-size instead of time. It looks like that might work for this case, but if there are only a few REALLY big files then we might want to do a mixture between multi-file reading and single-file chunked reading.

valiantljk commented 8 years ago

I like the idea of mixture between multi-file and single file reading. That seems to target a more complex file distribution pattern.