Closed reiase closed 2 years ago
This API requires further review and discussion @GuoRentong @fzliu
@reiase Is there any way we can split the high level pythonic pipeline creation and the datacollection? I feel like a user should be able to create a pipeline without having to use the extra complexity that comes from datacollection.
Background and Motivation
Features
Mixin
extensions; #672Design
Design Goal:
Core API
DataCollection provides a set of functional programming APIs that allows the user to assemble their data processing pipeline by chaining functions and operators. The method-chaining style API is proved to be very powerful in the area of data science and is widely accepted by the community.
Core API of DataCollection is designed to stay close with
scala
's andspark
's collection API, with additional improvements for accelerating some work in python.Creating
DataCollection is always created from
Iterable
sCombining small data collections will produce a larger one, this is useful if the data is collected from different sources:
Data collections can be ziped to a wider one:
Zipping two data collection is useful when benchmarking a model on datasets, which needs predictions and labels to calculate model metrics.
Collection API
Collection API is a set of high-order functional APIs that apply function/operator to the collection and produce a new data collection.
map: apply function/operator to each elements and generate a new collection of the results
multi-line function is also supported by decorator syntax:
Notice that function name is used as variable name for result data collection. In another word,
result
is a data collection, not a function.filter: apply a filter to each element and generate a new collection that only contains filtered elements.
Advanced Collection API
Chaining functions
There are two ways for chaining functions/operators
chained collection API calls
pipeline operator
>>
Use DataCollection with towhee
DataCollection support call dispatcher, which will redirect function on data collection to extended APIs. For examples:
We can dispatch function calls to towhee operators, makes it much easier to assemble a towhee pipeline
Examples for building an image search engine with towhee
Pros and Cons
No response
Anything else? (Additional Context)
No response