Open pgaref opened 8 years ago
Basically you want to make an input reader which is pushes data instead of waiting for data to be polled? I'm not sure that's a new type of source. A new component between the source and the consumer which polls the source and notifies the consumer should accomplish that.
We just need to have a source that is not synthetic. For that we have the FileSource (that reads from file), now we also have a TextFileSource and there will be soon an HDFS source. That should cover 60% of needs for a scheduled job. Let me know if it's otherwise.
@WJCIV I dont think we need to change the way data are being handled. My main concerns are:
Sources should not be part of the schedule description
Both FileSource and TextFileSource are markers. They really only provide information to the next stage on how it should read data. Even if you write a custom source, the assumption right now is that each stage runs till completion.
We need to be able to specify which specific worker(s) is going to act as a source (this is a more generic SEEP feature: being able to statically specify where an operator will be deployed)
In scheduled mode, we assume data is distributed across all nodes.
I think you are facing a more involved use case. It may be worth to have a chat to clarify what is exactly that you need. That way we can create more specific issues. In the meantime, this thread is very useful.
What happens when the data are not consumed fast enough?
Currently the size of a Dataset will just grow until we run out of memory. Maybe we should add a maximum size and rate limit a reader to wait for something to be consumed to free up space?
Or the other way round when the sources cannot produce enough load
That's a problem with the current code, and slightly less of a problem with the code in the branch I am editing right now. With the current code the reading buffer in the Dataset catches the writing buffer and calls flip on that buffer. Then the writer must be considered obsolete to avoid potentially overwriting something. The code I am working on detects that the writer has been caught and allocates a new buffer for the writer, so the writer always stays one step ahead. However, if there is nothing to be consumed the read still returns null (since there is not another record pending). In theory you could, as a consumer, know that you haven't actually reached the end of the input and keep polling if you get a null return. Then we have to worry about how you know you've actually reached the end of the stream.
Currently the size of a Dataset will just grow until we run out of memory. Maybe we should add a maximum size and rate limit a reader to wait for something to be consumed to free up space?
One natural way of dealing with this problem is to schedule the stage in rounds, except one single time and wait until completion. Can someone check what is Spark's strategy on 1.6? I think that's a good model to follow in this case.
Or the other way round when the sources cannot produce enough load
I don't understand this point.
We need a SEEP-Source implementation that is always available - meaining that it will never be scheduled. The Source receives data from network / or reads from file and keeps everything in memory. It then sends memory-data to the next dependent stage whenever is available (being scheduled).
For the basic implementation we need:
Request for comments: @raulcf @WJCIV