Open jongwook opened 6 years ago
Sorry to take so long in responding to this! (What's 18 months between friends, anyway...)
I like the general idea here, but I'm a little wary of scope. Much of what you describe applies to any kind of iterable sequence, and could already be achieved using functionality in itertools. I'm not sure we need to do anything here.
The one big exception would be group
, which really needs to understand the shape of iterates, and which we already implement as buffer
. (shuffle
is a bit of a weird case, but I think you can reasonably implement that as a generic iterator.)
I'd love to have a monadic/functional methods on Streamer and its subclasses. Those include:
map()
to apply data transformation like normalizationflatmap()
for parsing files, augmenting datafilter()
to filter data based on some criteriashuffle()
to shuffle the dataset based on a buffergroup()
to make fixed-sized batches of data pointstake()
/skip()
to get the first few or other than the first few data pointsThis programming style and nomenclature are from the functional programming idioms, but they are becoming increasingly popular in both server-side asynchronous/event-based programming and data processing. Examples include ReactiveX in many languages, CompletableFuture in Java 8+, every collection type in Scala, RDD in Apache Spark, and most recently tf.data from TensorFlow.
Personally, I think this is the right API to deal with the flow of the large datasets we work on - It makes it simple to compose short and easy-to-debug operations to build a wide range of complex behaviors, and the data is treated immutable making the whole process less error-prone.
I can imagine the following kinds of use cases:
Streamer(filenames).flatmap(tf.python_io.tf_record_iterator)
will transform a Streamer of file names to a streamer of tensorflow Example protocol buffers.tf.parse_single_example
where it isn't a TensorFlow operation and we don't have to deal with Graph and Session quirksThe proof of concept implementation is in Python 3 (will break miserably in Python 2 for now) and has an implementation of
Streamer
class with desired methods, along with example usages. The part 4 shows how this API can perform multiplexing on its own, and part 5 shows how it can be incorporated to the current Class API without breaking it. It relies heavily on generator comprehensions and lambdas to enable lazy evaluation (which makes the examples like the sieve of Eratosthenes possible), but I hope this can be incorporated in a less hacky-looking way.The naming and arguments for each method and whether or not to include them are up to discussion. Please let me know how you'd think!