rajasekarv / vega

A new arguably faster implementation of Apache Spark from scratch in Rust
Apache License 2.0
2.23k stars 205 forks source link

Handling resources destruction when program exits #26

Closed rajasekarv closed 4 years ago

rajasekarv commented 4 years ago

Ctrl-C handling, proper destruction of resources in case of panic and remove explicit drop executor logic. Instead of cloning Context like currently, create a single context and wrap it inside ref count and move resource destruction logic like deleting all temp files and closing all spinned processes inside Drop trait.

iduartgomez commented 4 years ago

One thing that would help probably is implementing std::ops::Drop for Context and calling any clean up procedures there when Context goes out of scope.

rajasekarv commented 4 years ago

That's the idea. But Drop will get triggered every time Context goes out of scope and it happens a lot now. So when we create a context, we should return it by wrapping it in Arc and use that throughout so that it will be dropped only once at the end of the program.

gzsombor commented 4 years ago

Probably removing the global variables around the Env could be an improvement as well, it makes unclear, if you create two context what could happen, and it looks strange to specify an environment variable,just to run the integration tests

rajasekarv commented 4 years ago

@gzsombor Spark does not allow multiple creations of Spark context by default AFAIK. Even if you configure it to allow multiple contexts, you can't have them both active at the same time. The only way to use context in multiple threads is by sharing it through Arc, which is the default behavior here now. Creating Spark Context now gives you Arc. So you can just clone it and use it in multiple threads to parallelize jobs. But within a single Application, only one Spark context is allowed to be active at a time.

gzsombor commented 4 years ago

Yes, supporting multiple context maybe not the most important thing, but I'm referring to the fact, that the Env struct, and the lazy_static variables are not the cleanest solutions from an architectural standpoint. Maybe having an explicit link from the Context to the Env (through an Arc), which can be passed in at construction time, if needed, or created automatically if not. So the Env could be destroyed, when not needed anymore. What do you think, is it worth working on this?

rajasekarv commented 4 years ago

I have thought about it a bit before finalizing it. It is difficult to achieve it in any other way as far as I have seen. Finally I just accepted the Apache Spark way. The reason is that the global static items are basically storage units(cache, map outputs) and needs to be accessed by tasks which are completely independent from the viewpoint of executor(driver only keeps track of the dependencies), and should be accessible across multiple jobs and sessions. So apart from having global static variables, I don't think there are many other choices. Feel free to experiment with it and if you find any other good way, then we can discuss it and try to implement it.

iduartgomez commented 4 years ago

This is mostly improved now by #75, only left to do is handling of Ctrl-C which I probably will check out soon.