soiltechproject / fsm-docs

https://www.soiltechproject.org/
3 stars 0 forks source link

[Translation] CLHC for Horizontal Scaling #5

Open KipCrossing opened 4 years ago

KipCrossing commented 4 years ago

Problem:

Horizontal scaling involves adding nodes to deal with processing large datasets. A common method or processing these large datasets across many nodes is the mapreduce method. Mapreduce is currently being used by the soiltech project.

The conditioned Latin hypercube sampling method as outlined by Minasny and McBratney 2006 requires certain variables to be shared by all nodes. These Shared variables include:

Ideally for horizontal scaling mapreduce problems, the processing of the data should not need to be shared by nodes.

Solution:

The quality of the selected samples are determined by the Objective Functions; hypothetically, this may be performed on any set of random samples or size N. Therefore, the entire dataset may be used to make groups of N random samples. Then the Objective function can be applied on each group. Further, new unique groups may be made by randomising the dataset again and hence the process may be repeated to increase the probability of the 'best' sample set. The Suggested method is a follows:

Note: this will be done within the context of the methods provided by the Apache Spark library

  1. Get quantile definitions by sorting
  2. Map a random number to each of the potential samples
  3. Sort the dataset based on the random number
  4. Reduce data into groups of N samples
  5. Calculate the Objective Factor (OF) for each group
  6. Get group with the best OF

Repeat (1) - (6) 10 times and choose the best sample group. (may be more than 10 to increase odds).

This alternitive method does not require any of the nodes to share variables. Further there is no limit on the size of the dataset. Lastly, the entire dataset may be used to obtain the final group of samples.

KipCrossing commented 4 years ago

Comment below if you would like to view the code.

KipCrossing commented 4 years ago

Using the original method:

Quantiles_old2

Using the new method:

Quantiles