holdenk / spark-testing-base

Base classes to use when writing tests with Spark
Apache License 2.0
1.52k stars 358 forks source link

Testing Structured Streaming Use Cases #164

Open ghost opened 7 years ago

ghost commented 7 years ago

I'm wondering what it would take to support testing of structured streaming use cases. I'm soon going to need this for my current work (and integration with Kafka). If you could describe high-level any touch points for this and things to watch out for, myself and a coworker could start work on this and hopefully send a PR soon.

holdenk commented 7 years ago

If your still interested in this, you can take a look at the tests inside of Spark's own structured streaming and see if there is anything from there we can factor out. Otherwise probably something using an InMemory sink initial would be a good starting point.

assiotis commented 7 years ago

The StreamTest trait looks particularly useful but it is very very tightly coupled with the rest of the Spark code base.

A small first step would perhaps figuring out how to test time-based jobs, especially in the context of stateful aggregations where you might be waiting for a timeout. The Spark code base can inject a ManualClock via the private StartQuery() method but no such mechanism exists outside the spark package namespace. Would that be something that the Apache project needs to expose before any work in factoring out the StreamTest trait and including it here can begin?

holdenk commented 7 years ago

So if you look in our code we put a bunch of things in the org.apache.spark namespace for this reason. It's not great but since the core Spark devs have yet to be convinced by me that we need to expose reasonable test APIs it's what we are left with.

eddy-geek commented 6 years ago

https://github.com/holdenk/spark-testing-base/commit/6ea1b36a5f8bd1201db523dae612fd1728bd572a is relevant to this issue.

Add 2.2 specific test base for StructuredStreaming. While StructuredStreaming is supported back to 2.0 I'd rather keep it simple since we didn't previously support tests for the earlier versions we can stick with supporting the nominally GAed support versions.