terascope / teraslice

Scalable data processing pipelines in JavaScript
https://terascope.github.io/teraslice/
Apache License 2.0
50 stars 13 forks source link

Data-mate functions discussion #2553

Closed ciorg closed 1 month ago

ciorg commented 3 years ago

Initial thoughts on additional data-mate functions or changes. Related to ticket #2242

There are functions in ts-transforms, isIn, isNumeric, matches, etc..., that are imported from the Validator package, but are not available in data-mate. Seems like all validations should draw from one source, and any external packages should go to the lowest level first and be accessed from there.

Validation additions:

Validation Changes

All the other additional functions I had are found in ts-transforms from the validator module.

kstaken commented 3 years ago

Goals

The goals for these functions are to provide a consistent set of data validation and transform functions usable in all contexts

  1. data engineering workflows
  2. ts-transforms (goal to phase out)
  3. xLucene
  4. DataFrames
  5. QPL Query directives
  6. QPL Variable directives
  7. QPL Transform directives (future replacement for ts-transforms)

Given this broad suite of use cases changes have to be cognizant of the requirements for all use cases. So things like "Removing the use of parent context since that is not available in DataFrames" is not the correct approach as breaking the library because of a limitation in one context is problematic.

The correct approach has 2 parts:

  1. DataFrames or other contexts should exclude operations that require things that aren't yet supported.
  2. DataFrames and other contexts should eventually gain the necessary support for the things that are excluded (if it makes sense to do so).

Typing

I think one of the key questions is going to be how strongly typed this library is. Part of the problem is that a big part of the reason for this library is to help validate the application of types. It's worth considering if we make that more explicit as it's definitely going to be an issue as we push more into QPL context where types are required.

Primitive functions

For DataFrames and QPL we're also going to need to define a suite of common base functions to manipulate particular types of data. This means stuff that exists at the basic level in Javascript like String related functions will need to be included in this library and adapted for the chaining workflow. This will be the case for all basic types:

What we surface for this will be a balancing act as we don't want to surface all the core functions that you can use in Javascript but we do want to surface enough that users have enough power to solve the majority of data level problems they will face.

Each category above will need to be considered in depth separately. Primary usage will be QPL so we'll want to focus on how they function and combine in that context. We'll also want to consider namespacing and future extensions.

Selector syntax

On transform type functions that may be operating on nested JSON data we're going to want a consistent way to select and flatten the data. This may be related to the Expression syntax or it may be something different I'm not sure but it is something we need to consider. The purpose of this is to just select portions of the JSON tree for further processing.

In the reporting prototype code I had used this library: https://www.npmjs.com/package/@gizt/selector

Obviously in some scenarios we also use xLucene as a selector syntax but it's really more of an expression language as it doesn't have clean ability to access deep into JSON docs. (One of the things we want to solve with xLucene integration with this lib) so you might be able to say something like select(path.sub.array[0].value): 100 AND select(path.sub2.array[0].value): add(select(path.sub.array[0].value), 100). Obviously a contrived example but as we increase usage of xLucene in larger contexts like QPL we'll want more power to control it.

We also need to consider this in relation to regex as it is also technically a selector syntax although fundamentally different in usage and intention.

Expression syntax

This is related to the concept of Selectors and to xLucene but brings in conditionals and a little more scripting type logic without exposing a full scripting language. We have looked to Jexl for this so far: https://www.npmjs.com/package/jexl

I'm not entirely enthused about the idea of having selectors, xLucene, regex and Jexl which opens the question if maybe xLucene in general should be extended to support all cases directly. That's outside the scope of this library but does affect how we approach some things here so I wanted to mention it.

Overall the expression syntax in QPL context would be a fall back that should almost never be used but when you need it, it will be important and I believe it comes from this library rather than a language level feature but could be wrong.

kstaken commented 3 years ago

The model here, for influence, are database level function sets. Obviously these are very mature systems with many, many functions so we're not trying to clone these capabilities in general but the functions we're developing are intended to serve the same role.

Systems most similar to QPL and some idea of what a function library will "eventually" need to cover.

More mature Relational Databases

Also since one of our goals is going to be to build a SQL dialect for QPL we should consider functions in reference to the SQL standard. It's hard to find a simple list of what the standard actually supports but it appears fairly rudimentary. The PostgreSQL link does make reference to SQL92 functions and an out of date copy of the 92 spec is here: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

peterdemartini commented 3 years ago

After thinking about the changes a bit more, I want to revise my list of suggestions:

Registry

Functions

Naming

I cannot express how vital naming is here, everything from the function naming to arguments. Breaking those will be VERY difficult, if we don't get it right now, it will stick around forever. There should be a consistency in the naming AND it should be intuitive for humans who do NOT know how to code.

Example 1

Note this roughly based on the current column trim function.

export type FieldTransformTrimArgs = {
    char?: string;
};

/**
 * Trim whitespace, or specific character, from the beginning and end of a string
 *
 * @example
 *
 *      trim()
 *          // '     left' => 'left'
 *          // 'right    ' => 'right'
 *          // '  center ' => 'center'
 *          // '         ' => ''
 *      trim(char: 'fast')
 *          // 'fast cars race fast' => ' cars race '
 *      trim(char: '.*')
 *          // '.*.*a regex test.*.*.*.*' => 'a regex test'
 *      trim(char: '\r')
 *          // '\t\r\rexample\r\r' => 'example'
 */
export const trimFieldTransformConfig: FieldTransformConfig<string, string, TrimArgs> = {
    type: TransformType.FIELD_TRANSFORM,
    process_mode: ProcessMode.INDIVIDUAL_VALUES,
    create: (args: FieldTransformTrimArgs, inputType?: DataTypeFieldConfig, parentContext?: Record<string, unknown>) => {
        validateArguments(trimFieldTransformConfig.argument_schema, args);
        // this is where you'd also throw if the input type
        // is not correct
        return trimFP(args.chart);
    },
    description: 'Trim whitespace, or specific character, from the beginning and end of a string',
    argument_schema: {
        char: {
            type: FieldType.String,
            description: 'The character to trim'
        }
    },
    requires_parent_context: false,
    accepts: [FieldType.String],
    output_type: (inputType: FieldTypeConfig) => {
        if (isNotStringType(inputType.type)) return inputType;
        return { ...inputType, type: FieldType.String };
    }
};
peterdemartini commented 3 years ago

SIDE NOTE: Also I think there is one thing we've learned, we can't trust validator.js or other dependencies, so we need comprehensive tests of our own. When possible we should submit a PR to fix validator.js or the other dependencies.

peterdemartini commented 3 years ago

Some notes after our call:

Example

function simpleFunctionAdapter(
    fnDef: FunctionDefinition,
    input: Record<string, unknown>[],
    args: Record<string, unknown>,
): Record<string, unknown>[] {
   // this where the function definition
   // execution behavior will change
   // depending on the process_mode
   return output;
}

@jsnoble double check on parent context use because if we aren't using it ts-transforms, we should remove it since it might be premature.

Also I can't remember if we added generic aggregation functions BUT i'd imagine those too premature right now to keep. The problem with those is that might require very context specific operations.

jsnoble commented 3 years ago

After further discussion with @peterdemartini, we came up with an initial game plan on how adapters work, but there are many questions on expected behavior

NOTE: we talked about removing jexl expressions support from the extract Field-Transform, and stick it into a Record-Transform as it innately works on the entire object itself

Questions:

initial spec process_mode:

    /** This indicates that it operations on non-nil individual values */
    INDIVIDUAL_VALUES = 'INDIVIDUAL_VALUES',
    /** This indicates that it operations on entire values, including nulls and arrays */
    FULL_VALUES = 'FULL_VALUES'

This is not flexible enough as is and might need to take into account the Function Definition Type into account to refine its behavior. validators should always be able to work on nil values while transforms should never be able to. Its still a little toss up in the air about aggregators with DataFrames, need to further work with Peter on that.

list of possible behavior:

Question 1: Would it be explicit enough to take into account the function type along with the process_mode to determine the behavior, or should we be super explicit and require a unique process_mode for each possible scenario despite the type?

Question 2: How flexible should the general function adapter be in regards to input (not necessarily the DataFrame adapter as that is more constrained)?

initial thoughts:

import { isBooleanConfig , functionAdapter } from '../src'

const isBoolean = functionAdapter(isBooleanConfig);

isBoolean(undefined) => false
isBoolean(false) => true
isBoolean('true') => false
isBoolean(true) => true
isBoolean([ true, false]) => [true, true]
isBoolean([ true, null,  false]) => [true, false, true]

const isBooleanWithField = functionAdapter(isBooleanConfig, 'someField');

isBoolean({ someField: true }) => true
isBoolean({ otherField: true }) => false

isBoolean([{ someField: true }, { otherField: true }]) => [true, false]

const isBooleanWithNestedField = functionAdapter(isBooleanConfig, 'some.nested[0].field');
const data = {
    some: {
        nested: [{ field: true }]
    }
};

isBoolean(data) => true

// NOTE that if a nested field starts off with  an [] selector, it does not work on the entire array as isBooleanConfig
// `process_mode` is set to `INDIVIDUAL_VALUES`, so it has to execute on an array of arrays

const isBooleanWithNestedFieldArray = functionAdapter(isBooleanConfig, '[0]some.nested[0].field');

const data = [
    [
        {
            some: {
                nested: [{ field: true }]
            }
        },
        {
            some: {
                nested: [{ field: 'I will be missed' }]
            }
        }
    ],
    [
        {
            some: {
                nested: [{ field: 'true' }]
            }
        },
        {
            some: {
                nested: [{ field: 'I will be missed' }]
            }
        }
    ]
];

isBoolean(data) => [true, false]

// In order to check all the data, it would have to be flattened and the field take off the first [] selector
kstaken commented 3 years ago

@ciorg do any of the existing ts-transforms rules we run use jexl?

kstaken commented 3 years ago

Is there a reason Field-Aggregations are a separate concept from Field-Transforms? Seems to me an aggregation could just be considered a transform of Array input -> value output.

ciorg commented 3 years ago

no jexl expressions in our ts-transforms rules

jsnoble commented 3 years ago

Is there a reason Field-Aggregations are a separate concept from Field-Transforms? Seems to me an aggregation could just be considered a transform of Array input -> value output.

It can be viewed that way but it could depend on a few distinctions. In my comment about behaviors I mentioned that transformations cannot handle nil values will aggregation possibly could, like what happens in @countField, though it could be different with DataFrames

kstaken commented 3 years ago

That still seems to be the same to me. In the case of an input array the null values are in the array [1,null,2,5,null] not the array input it self so the handling of those nulls is up to the count operation not the framework.

kstaken commented 1 month ago

These have been in production for a long time. I think we can close this now. Any issues will get new tickets.