Scheduled Source [proposed Label] feature

pgaref commented 8 years ago

We need a SEEP-Source implementation that is always available - meaining that it will never be scheduled. The Source receives data from network / or reads from file and keeps everything in memory. It then sends memory-data to the next dependent stage whenever is available (being scheduled).

For the basic implementation we need:

A network Source ( keep data in memory until next stage is available )
A file Source which could also read from HDFS
Modifications in the schedule description which needs to be aware of the Sources
Probably changes in the DataReferenceManager class

Request for comments: @raulcf @WJCIV

WJCIV commented 8 years ago

Basically you want to make an input reader which is pushes data instead of waiting for data to be polled? I'm not sure that's a new type of source. A new component between the source and the consumer which polls the source and notifies the consumer should accomplish that.

raulcf commented 8 years ago

We just need to have a source that is not synthetic. For that we have the FileSource (that reads from file), now we also have a TextFileSource and there will be soon an HDFS source. That should cover 60% of needs for a scheduled job. Let me know if it's otherwise.

pgaref commented 8 years ago

@WJCIV I dont think we need to change the way data are being handled. My main concerns are:

Sources should not be part of the schedule description -> Meaning that we are going to have two types of workers at the same time? Some of them running in scheduled mode and the ones running the Sources will be materialised? This decision though is made by the seep-master before loading the query - @raulcf Am I missing something here?
We need to be able to specify which specific worker(s) is going to act as a source (this is a more generic SEEP feature: being able to statically specify where an operator will be deployed). Then the stage DataReferences should be updated with the correct data-source address.
Of cource we already have some Source implementations ( I think we will definately need Network, File and HDFS ) but we need to integrate this logic to the scheduled version were we are going to face issues like: What happens when the data are not consumed fast enough? Or the other way round when the sources cannot produce enough load.. Again correct me if I am missing something

raulcf commented 8 years ago

Sources should not be part of the schedule description

Both FileSource and TextFileSource are markers. They really only provide information to the next stage on how it should read data. Even if you write a custom source, the assumption right now is that each stage runs till completion.

We need to be able to specify which specific worker(s) is going to act as a source (this is a more generic SEEP feature: being able to statically specify where an operator will be deployed)

In scheduled mode, we assume data is distributed across all nodes.

I think you are facing a more involved use case. It may be worth to have a chat to clarify what is exactly that you need. That way we can create more specific issues. In the meantime, this thread is very useful.

WJCIV commented 8 years ago

What happens when the data are not consumed fast enough?

Currently the size of a Dataset will just grow until we run out of memory. Maybe we should add a maximum size and rate limit a reader to wait for something to be consumed to free up space?

Or the other way round when the sources cannot produce enough load

That's a problem with the current code, and slightly less of a problem with the code in the branch I am editing right now. With the current code the reading buffer in the Dataset catches the writing buffer and calls flip on that buffer. Then the writer must be considered obsolete to avoid potentially overwriting something. The code I am working on detects that the writer has been caught and allocates a new buffer for the writer, so the writer always stays one step ahead. However, if there is nothing to be consumed the read still returns null (since there is not another record pending). In theory you could, as a consumer, know that you haven't actually reached the end of the input and keep polling if you get a null return. Then we have to worry about how you know you've actually reached the end of the stream.

raulcf commented 8 years ago

Currently the size of a Dataset will just grow until we run out of memory. Maybe we should add a maximum size and rate limit a reader to wait for something to be consumed to free up space?

One natural way of dealing with this problem is to schedule the stage in rounds, except one single time and wait until completion. Can someone check what is Spark's strategy on 1.6? I think that's a good model to follow in this case.

Or the other way round when the sources cannot produce enough load

I don't understand this point.

raulcf / SEEPng

Scheduled Source [proposed Label] feature #60