rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 206 forks source link

Improve application configuration execution/deployment #54

Closed iduartgomez closed 4 years ago

iduartgomez commented 4 years ago

Right now the way we are doing configuration is a bit lacklustre in the following way: we are using clap to parse many of the configuration parameters, passing them by command line argument, this creates a problem where in an user created application it will collide with their own command line arguments.

Similarly, this already collides with cargo own optional parameters, for example something like this will fail: cargo test -- --test-threads=1.

We must provide a more elegant and ergonomic way to pass configuration parameters which may not collide with user (or generated, e.g. cargo) code. A first approach is to add/revamp the configuration file we are already using (hosts.conf) to include more configuration parameters, which we would eventually have to do anyway. Additionally, centralize all the environment variables configuration managment (under env.rs) on initialization and document that, so the user can use those to set up any required parameters.

Also for local execution and testing, many of the defaults could be provided (e.g. NS_LOCAL_IP) so they don't require to be provided either by env variable or argument parameter (e.g. Spark itself assigns a free local ip if necessary when executing in local mode).

iduartgomez commented 4 years ago

Oh, on top, we probably want to replace most info! log traces with debug and add this a configuration parameter too.

Additionally all the internal tracing info can be placed under a feature flag probably, but that could be an other issue/PR.

iduartgomez commented 4 years ago

The log level was already parametrized, I had forgotten about it: NS_LOG_LEVEL. Gonna make a commit swapping info for debug, this should remove most noise when running tests by default.

Also provided a default NS_LOCAL_IP in case is not specified, only for local deployment mode.

EDIT: done in b78e3f5

iduartgomez commented 4 years ago

Re. #73 I better comment here.

@rajasekarv the current solution is not optimal either as passing by argument collides with user program arguments and is hacky, we must find an other way to pass the arguments to the workers.

Maybe we can pass the conf to the workers by creating a file in the temp work dir or something alike. what do you think about this option?

Re. " plus ENV variables should not represent program-specific arguments", Until we have the real submit and master/slave deployment machinery we must settle with somethign temproarily which is not intrusive. So is either this way or though a conf file (or both?)

rajasekarv commented 4 years ago

I think that is Ok. We are using named arguments so it should be fine with regards to collision with user arguments. Basically we are doing what spark-submit does directly in context. Even if we separate it out, the mechanism would remain the same.

iduartgomez commented 4 years ago

With the collision I was referring more to the argument parsers etc. but maybe this is not an issue if we do it manually instead of relying on something like clap which is a self-contained solution which is very intrusive if you have 'alien' (unexpected) arguments etc. I will explore what is easier/best option a bit and make a proper PR this time.

iduartgomez commented 4 years ago

Improved with #75