Open ghost opened 7 years ago
If your still interested in this, you can take a look at the tests inside of Spark's own structured streaming and see if there is anything from there we can factor out. Otherwise probably something using an InMemory sink initial would be a good starting point.
The StreamTest trait looks particularly useful but it is very very tightly coupled with the rest of the Spark code base.
A small first step would perhaps figuring out how to test time-based jobs, especially in the context of stateful aggregations where you might be waiting for a timeout. The Spark code base can inject a ManualClock via the private StartQuery()
method but no such mechanism exists outside the spark package namespace. Would that be something that the Apache project needs to expose before any work in factoring out the StreamTest trait and including it here can begin?
So if you look in our code we put a bunch of things in the org.apache.spark namespace for this reason. It's not great but since the core Spark devs have yet to be convinced by me that we need to expose reasonable test APIs it's what we are left with.
https://github.com/holdenk/spark-testing-base/commit/6ea1b36a5f8bd1201db523dae612fd1728bd572a is relevant to this issue.
Add 2.2 specific test base for StructuredStreaming. While StructuredStreaming is supported back to 2.0 I'd rather keep it simple since we didn't previously support tests for the earlier versions we can stick with supporting the nominally GAed support versions.
I'm wondering what it would take to support testing of structured streaming use cases. I'm soon going to need this for my current work (and integration with Kafka). If you could describe high-level any touch points for this and things to watch out for, myself and a coworker could start work on this and hopefully send a PR soon.