Closed ericemc3 closed 3 years ago
Great question. We should put this in the documentation, too, as I think plenty of others will also be asking this. Here is a first draft, let me know what you think...
op
functions supported?Any function that is callable within an Arquero table expression must be defined on the op
object, either as a built-in function or added via the extensibility API. Why? Why can't one just use a function directly?
As described earlier, Arquero table expressions can look like normal JavaScript functions, but are treated specially: their source code is parsed and new custom functions are generated to process data. This process prevents the use of closures, such as referencing functions or values defined externally to the expression.
But why do we do this? Here are a few reasons:
Performance. After parsing an expression, Arquero performs code generation, often creating more performant code in the process. This level of indirection also allows us to generate optimized expressions for certain inputs, such as Apache Arrow data.
Flexibility. Providing our own parsing also allows us to introduce new kinds of backing data without changing the API. For example, we could add support for different underlying data formats and storage layouts.
Portability. While a common use case of Arquero is to query data directly in the same JavaScript runtime, Arquero verbs can also be serialized as queries: one can specify verbs in one environment, but then send them to another environment for processing. For example, the arquero-worker package sends queries to a worker thread, while the arquero-sql package sends them to a backing database server. As custom methods may not be defined in those environments, Arquero is designed to make this translation between environments possible and easier to reason about.
Safety. Arquero table expressions do not let you call methods defined on input data values. For example, to trim a string you must call op.trim(str)
, not str.trim()
. Again, this aids portability: otherwise unsupported methods defined on input data elements might "sneak" in to the processing. Invoking arbitrary methods may also lead to security vulnerabilities when allowing untrusted third parties to submit queries into a system.
Discoverability. Defining all functions on a single object provides a single catalog of all available operations. In most IDEs, you can simply type op.
(and perhaps hit the tab character) to the see a list of all available functions and benefit from auto-complete!
Of course, one might wish to make different trade-offs. Arquero is designed to support common use cases while also being applicable to more complex production setups. This goal comes with the cost of more rigid management of functions. That said, Arquero can be extended with custom variables, functions, and even new table methods or verbs! As starting points, see the params, addFunction, and addTableMethod functions to introduce external variables, register new op
functions, or extend tables with new methods.
Thank you very much for these insights, that i find very enlightening. Arrow support, safety and performance are 3 key criterias, whose importance is easy to demonstrate. I have now enough material to synthetize and include into my presentation!
I consider Arquero as a huge step forward for rich and efficient web open dataflows, and also a great boost for Arrow, D3, Vega-Lite-API and datavisualization in general. Thank you once again for your work, and also for helping, with Mike Bostock, Arquero and Observable to work well together.
I have other (falsely naive) questions, i should probably ask them somewhere else. Anyway, here they are:
- What was Arquero originally designed for, leverage Arrow capabilities with JS, extend Vega data-transformation features, other motivations?
It began as a side project for fun during my academic sabbatical. Then it kind of steam-rolled into a full-fledged library. The goal was to build a more performant and adaptable JS query tool that extends what Vega can do and make it available outside of Vega specifications. I wanted my students and others working with Vega or D3 to be able to prepare/transform data comprehensively without having to move between different environments. The original focus was to support standard JS data structures first and foremost. Only later did I seek to push the API further by also providing direct Arrow support.
- What is the team behind Arquero, you only, other people?
It is primarily just me for the core library. @chanwutk and @suikac have been working on arquero-sql.
- What kind of feedback are you expecting from Arquero users?
All the standard stuff: feature requests, bug reports, documentation feedback, etc. I'd also love to hear from anyone using the library to learn more about what they are using it for.
thank you for these additions.
This question is maybe already answered somewhere but i havn't found it yet.
I am preparing a tutorial and i'd like to explain that point (which is not a problem for me), in case i am asked about.
Why is it not possible to use:
tb.filter(d => d.codgeo.substr(0,2) == '31') // => yields an error message Invalid function call: "d.codgeo.substr(0,2)"
and why is this syntax necessary, with
op.
:tb.filter(d => op.substring(d.codgeo, 0, 2) == '31')
note that:
tb.filter(d => d.codgeo.substring(0, 2) == '31')
does not yield an error (probably becauseop.substring()
exists), but of course doesn't work