rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 205 forks source link

Take sample #30

Closed iduartgomez closed 4 years ago

iduartgomez commented 4 years ago

Changelog

Comments

There is a litt bit of unsafe on this PR (check utils/mod.rs:15). This algorithm (pretty standard shuffling algo) is a pain to implement without unsafe and complicates it unnecessarily. This bit of unsafe does not leak and is pretty contained (and well tested) so should be no problem, it just swaps in-place two positions on a borrowed Vec so there is no danger here.

An other thing to mention is the implementation of RandomSampler, due to the nature of the Rdd's (which must be serializable etc.) I couldn't 'embed' the RNG struct in the concrete object as a field, the solution was to instead return a closure which takes the input of the rdd iterator and applies any randomization + sampling per element. Functionally is exactly the same as the Scala implementation but it's rather more of a functional approach instead of 'OO' (in the Spark Scala implementation the rn is owned by the sampler class).

This idiom is something that will probably will come handy in the future when we are limited by the requirements of the Rdd structs and cannot use directly third party structs.

iduartgomez commented 4 years ago

Everything fixed.