uccross / skyhookdm-ceph-cls

Skyhook Data Management: Storage and management of tabular data in Ceph.
https://www.skyhookdm.com
GNU Lesser General Public License v2.1
13 stars 9 forks source link

Add support for operations on List types to cls_tabular class #38

Open jlefevre opened 4 years ago

jlefevre commented 4 years ago

Scientific data often uses arrays, for example arrays (of non-uniform length) very commonly appear in ROOT data. These arrays are currently represented as List< > types in Apache Arrow within Skyhook.

Adding support for simple list operations on Arrow data allows us to "pushdown" selection predicates/filters into storage. This is beneficial for our scientific data and Arrow data more generally (lists are a native data type in Arrow). The most useful ops in our context are likely those for data reduction such as filters or summary/agg methods (min/max/count/sum/first), performing these in storage across billions of rows (rather than returning all data) will become important as we increase scientific data scales.

We need to identify (with help from data analysts and their workloads) some common and simple list ops used in scientific analysis, and then determine which of these can be implemented as pushdown predicates. Here is one example usage of array ops.

We can also use the Awkward arrays library to examine and motivate specific, simple ops to consider - though awkward will have much more complex ops as well. We do not want to re-implement any awkward specific operations, only to understand simple list ops that are generally useful for arrays, but in partcular ROOT analysis. Once this is well understood (which ops and how to offload) we may want to consider including the awkward python lib within Skyhook via a webassembly approach that we are developing. For example, here is a scientific python package compiled for webassembly.

This will involve the following steps.

We have jagged arrays types defined here, but do not yet support predicate operations on this data type: link

Apply predicates will be called here, which will include the specified new ops: link

Add a case to apply predicates on lists/jaggedarrays type: link

Add a compare (or similar) function, that applies the new ops on lists. Here is an example of a simple integer compare op: link. And a regex op and date op: link

Define any new operation types for list, here: link

Adding support for operations on list types will be useful in general in Skyhook for any supported data format, lists often appear in JSON and log/sensor data.

carlosmalt commented 4 years ago

Hi Jeff, Can you provide a bit more info on why instead of what? Who wants this functionality? I also like to see a plan of how this work can be isolated from any effects due to awkward rewrites (one already happened since the original release). Also, it would be so nice if the debugging work that Jim and others already performed on equivalent functionality withinawkward doesn't have to be repeated here. Perhaps this is a good use case for thinking about how to run libraries within Ceph/CLS using outcomes of Saloni's work on WebAssembly isolates.

jlefevre commented 4 years ago

Good points Carlos, thanks. I totally agree and updated the beginning of the comment to add motivation. We should definitely not duplicate awkward functionality, and get in the business of closely tracking awkward lib. It was really meant to use as a motivation of the types of list ops scientific analysts use. Probably the simple ops can apply to lists in general so I think this will be very useful. Regarding WebAssembly, yes I think once we understand how to express and offload list predicates, we should definitely consider the feasibility of a WebAssembly approach for awkward.

carlosmalt commented 4 years ago

Now that I know a bit more about Arrow, doesn't this duplicate Arrow compute and function functionalities?

jlefevre commented 4 years ago

Yes we can use the array ops available in Arrow. Xiongfeng is working on this.