Data-mate functions discussion

ciorg commented 3 years ago

Initial thoughts on additional data-mate functions or changes. Related to ticket #2242

There are functions in ts-transforms, isIn, isNumeric, matches, etc..., that are imported from the Validator package, but are not available in data-mate. Seems like all validations should draw from one source, and any external packages should go to the lowest level first and be accessed from there.

Validation additions:

isObject - this term is a bit ambiguous since object can mean a lot of things, but would check if value is a simple javascript object { foo: 'bar' } - maybe isSimpleObject?. @terascope/utils already has this function.

Validation Changes

isEmail checks if the incoming value contains an email address, but returns true for values that are definitely not email addresses. For example bob@bob@bob.com. This isn't the functionality I would expect.

All the other additional functions I had are found in ts-transforms from the validator module.

kstaken commented 3 years ago

Goals

The goals for these functions are to provide a consistent set of data validation and transform functions usable in all contexts

data engineering workflows
ts-transforms (goal to phase out)
xLucene
DataFrames
QPL Query directives
QPL Variable directives
QPL Transform directives (future replacement for ts-transforms)

Given this broad suite of use cases changes have to be cognizant of the requirements for all use cases. So things like "Removing the use of parent context since that is not available in DataFrames" is not the correct approach as breaking the library because of a limitation in one context is problematic.

The correct approach has 2 parts:

DataFrames or other contexts should exclude operations that require things that aren't yet supported.
DataFrames and other contexts should eventually gain the necessary support for the things that are excluded (if it makes sense to do so).

Typing

I think one of the key questions is going to be how strongly typed this library is. Part of the problem is that a big part of the reason for this library is to help validate the application of types. It's worth considering if we make that more explicit as it's definitely going to be an issue as we push more into QPL context where types are required.

Primitive functions

For DataFrames and QPL we're also going to need to define a suite of common base functions to manipulate particular types of data. This means stuff that exists at the basic level in Javascript like String related functions will need to be included in this library and adapted for the chaining workflow. This will be the case for all basic types:

strings
ips
arrays
geo shapes
datetime
numbers - math

What we surface for this will be a balancing act as we don't want to surface all the core functions that you can use in Javascript but we do want to surface enough that users have enough power to solve the majority of data level problems they will face.

Each category above will need to be considered in depth separately. Primary usage will be QPL so we'll want to focus on how they function and combine in that context. We'll also want to consider namespacing and future extensions.

Selector syntax

On transform type functions that may be operating on nested JSON data we're going to want a consistent way to select and flatten the data. This may be related to the Expression syntax or it may be something different I'm not sure but it is something we need to consider. The purpose of this is to just select portions of the JSON tree for further processing.

In the reporting prototype code I had used this library: https://www.npmjs.com/package/@gizt/selector

Obviously in some scenarios we also use xLucene as a selector syntax but it's really more of an expression language as it doesn't have clean ability to access deep into JSON docs. (One of the things we want to solve with xLucene integration with this lib) so you might be able to say something like select(path.sub.array[0].value): 100 AND select(path.sub2.array[0].value): add(select(path.sub.array[0].value), 100). Obviously a contrived example but as we increase usage of xLucene in larger contexts like QPL we'll want more power to control it.

We also need to consider this in relation to regex as it is also technically a selector syntax although fundamentally different in usage and intention.

Expression syntax

This is related to the concept of Selectors and to xLucene but brings in conditionals and a little more scripting type logic without exposing a full scripting language. We have looked to Jexl for this so far: https://www.npmjs.com/package/jexl

I'm not entirely enthused about the idea of having selectors, xLucene, regex and Jexl which opens the question if maybe xLucene in general should be extended to support all cases directly. That's outside the scope of this library but does affect how we approach some things here so I wanted to mention it.

Overall the expression syntax in QPL context would be a fall back that should almost never be used but when you need it, it will be important and I believe it comes from this library rather than a language level feature but could be wrong.

kstaken commented 3 years ago

The model here, for influence, are database level function sets. Obviously these are very mature systems with many, many functions so we're not trying to clone these capabilities in general but the functions we're developing are intended to serve the same role.

Systems most similar to QPL and some idea of what a function library will "eventually" need to cover.

More mature Relational Databases

Also since one of our goals is going to be to build a SQL dialect for QPL we should consider functions in reference to the SQL standard. It's hard to find a simple list of what the standard actually supports but it appears fairly rudimentary. The PostgreSQL link does make reference to SQL92 functions and an out of date copy of the 92 spec is here: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

peterdemartini commented 3 years ago

After thinking about the changes a bit more, I want to revise my list of suggestions:

Registry

Change config to argument_schema or argument_config to be more explicit what type of config it represents
Add meaningful descriptions to all argument fields and functions
Add requires_parent_context: boolean to the config
Add process_mode: ProcessMode (only available to Field level functions), which can either be used determine the iteration method, INDIVIDUAL_VALUES (non-nil and unwrapped from arrays) or ALL_VALUES (all values, nil and array values included)
Add accepts: readonly FieldType[] to the config, this will be used for defining the acceptable input field types.
- If it is empty, all field types are accepted
- If it contains a generic String field type, it will apply to all field types that are represented a string
- If it contains a generic Number field type, it will apply to all field types that are represented a number
Change output_type: FieldType to output_type: DataTypeFieldConfig|(inputFieldType: DataTypeFieldConfig) => DataTypeFieldConfig which will allow making the return type dynamic. Returning the whole config will allow us to change array or other field type properties.

Functions

Change the functions return a function that does the operation, the outer function needs to validate the arguments so it is only done O(1) not O(n). See example 1.
Combine parseDate and formatDate into a single toDate
Add support for BigInt
Add support for DateValue
Add support for IPValue
I would really like to see a separate src file and a test file for each function, this will make it clear what you are testing and prevent a single file from being too big.
Our tests should ALWAYS cover the examples listed in the code comments, this was not previously done.

Naming

I cannot express how vital naming is here, everything from the function naming to arguments. Breaking those will be VERY difficult, if we don't get it right now, it will stick around forever. There should be a consistency in the naming AND it should be intuitive for humans who do NOT know how to code.

Example 1

Note this roughly based on the current column trim function.

export type FieldTransformTrimArgs = {
    char?: string;
};

/**
 * Trim whitespace, or specific character, from the beginning and end of a string
 *
 * @example
 *
 *      trim()
 *          // '     left' => 'left'
 *          // 'right    ' => 'right'
 *          // '  center ' => 'center'
 *          // '         ' => ''
 *      trim(char: 'fast')
 *          // 'fast cars race fast' => ' cars race '
 *      trim(char: '.*')
 *          // '.*.*a regex test.*.*.*.*' => 'a regex test'
 *      trim(char: '\r')
 *          // '\t\r\rexample\r\r' => 'example'
 */
export const trimFieldTransformConfig: FieldTransformConfig<string, string, TrimArgs> = {
    type: TransformType.FIELD_TRANSFORM,
    process_mode: ProcessMode.INDIVIDUAL_VALUES,
    create: (args: FieldTransformTrimArgs, inputType?: DataTypeFieldConfig, parentContext?: Record<string, unknown>) => {
        validateArguments(trimFieldTransformConfig.argument_schema, args);
        // this is where you'd also throw if the input type
        // is not correct
        return trimFP(args.chart);
    },
    description: 'Trim whitespace, or specific character, from the beginning and end of a string',
    argument_schema: {
        char: {
            type: FieldType.String,
            description: 'The character to trim'
        }
    },
    requires_parent_context: false,
    accepts: [FieldType.String],
    output_type: (inputType: FieldTypeConfig) => {
        if (isNotStringType(inputType.type)) return inputType;
        return { ...inputType, type: FieldType.String };
    }
};

peterdemartini commented 3 years ago

SIDE NOTE: Also I think there is one thing we've learned, we can't trust validator.js or other dependencies, so we need comprehensive tests of our own. When possible we should submit a PR to fix validator.js or the other dependencies.

peterdemartini commented 3 years ago

Some notes after our call:

xpressions is a potential context, not xLucene
We should add adapters for each different context process take function definition. See example below. NOTE: The DataFrame adapter will likely replace the existing one

Example

function simpleFunctionAdapter(
    fnDef: FunctionDefinition,
    input: Record<string, unknown>[],
    args: Record<string, unknown>,
): Record<string, unknown>[] {
   // this where the function definition
   // execution behavior will change
   // depending on the process_mode
   return output;
}

@jsnoble double check on parent context use because if we aren't using it ts-transforms, we should remove it since it might be premature.

Also I can't remember if we added generic aggregation functions BUT i'd imagine those too premature right now to keep. The problem with those is that might require very context specific operations.

jsnoble commented 3 years ago

After further discussion with @peterdemartini, we came up with an initial game plan on how adapters work, but there are many questions on expected behavior

NOTE: we talked about removing jexl expressions support from the extract Field-Transform, and stick it into a Record-Transform as it innately works on the entire object itself

Questions:

initial spec process_mode:

    /** This indicates that it operations on non-nil individual values */
    INDIVIDUAL_VALUES = 'INDIVIDUAL_VALUES',
    /** This indicates that it operations on entire values, including nulls and arrays */
    FULL_VALUES = 'FULL_VALUES'

This is not flexible enough as is and might need to take into account the Function Definition Type into account to refine its behavior. validators should always be able to work on nil values while transforms should never be able to. Its still a little toss up in the air about aggregators with DataFrames, need to further work with Peter on that.

list of possible behavior:

Field-Validators with INDIVIDUAL_VALUES should accept nil, individual values, or call fn on each item in a list
Field-Validators with FULL_VALUES should accepts nil, or the whole value as is, used for validators like isObject or isArray or things like that
Field-Transforms with INDIVIDUAL_VALUES should NOT accept nil, but call on individual values, or call fn on each item in a list with NON-NIL values or it will throw
Field-Transforms with FULL_VALUES should NOT accept nil, but call on entire value, maybe for something like @toJSON or something like that
Field-Aggregations only works with FULL_VALUES can accept nil, but call on entire value, like @max
Record-Validatiors only works with FULL_VALUES, can accept nil values as well
Record-Transforms only works with FULL_VALUES, should NOT accept nil values
Record-Aggregations only works with FULL_VALUES but requires input to be an an array of objects

Question 1: Would it be explicit enough to take into account the function type along with the process_mode to determine the behavior, or should we be super explicit and require a unique process_mode for each possible scenario despite the type?

Question 2: How flexible should the general function adapter be in regards to input (not necessarily the DataFrame adapter as that is more constrained)?

initial thoughts:

Field level operations are the most flexible and can take a value, or can take an object or an array of objects if a field is specified (the field can be deeply nested)
Record level operations are restricted to take an object or an array of objects only

import { isBooleanConfig , functionAdapter } from '../src'

const isBoolean = functionAdapter(isBooleanConfig);

isBoolean(undefined) => false
isBoolean(false) => true
isBoolean('true') => false
isBoolean(true) => true
isBoolean([ true, false]) => [true, true]
isBoolean([ true, null,  false]) => [true, false, true]

const isBooleanWithField = functionAdapter(isBooleanConfig, 'someField');

isBoolean({ someField: true }) => true
isBoolean({ otherField: true }) => false

isBoolean([{ someField: true }, { otherField: true }]) => [true, false]

const isBooleanWithNestedField = functionAdapter(isBooleanConfig, 'some.nested[0].field');
const data = {
    some: {
        nested: [{ field: true }]
    }
};

isBoolean(data) => true

// NOTE that if a nested field starts off with  an [] selector, it does not work on the entire array as isBooleanConfig
// `process_mode` is set to `INDIVIDUAL_VALUES`, so it has to execute on an array of arrays

const isBooleanWithNestedFieldArray = functionAdapter(isBooleanConfig, '[0]some.nested[0].field');

const data = [
    [
        {
            some: {
                nested: [{ field: true }]
            }
        },
        {
            some: {
                nested: [{ field: 'I will be missed' }]
            }
        }
    ],
    [
        {
            some: {
                nested: [{ field: 'true' }]
            }
        },
        {
            some: {
                nested: [{ field: 'I will be missed' }]
            }
        }
    ]
];

isBoolean(data) => [true, false]

// In order to check all the data, it would have to be flattened and the field take off the first [] selector

kstaken commented 3 years ago

@ciorg do any of the existing ts-transforms rules we run use jexl?

kstaken commented 3 years ago

Is there a reason Field-Aggregations are a separate concept from Field-Transforms? Seems to me an aggregation could just be considered a transform of Array input -> value output.

ciorg commented 3 years ago

no jexl expressions in our ts-transforms rules

jsnoble commented 3 years ago

Is there a reason Field-Aggregations are a separate concept from Field-Transforms? Seems to me an aggregation could just be considered a transform of Array input -> value output.

It can be viewed that way but it could depend on a few distinctions. In my comment about behaviors I mentioned that transformations cannot handle nil values will aggregation possibly could, like what happens in @countField, though it could be different with DataFrames

kstaken commented 3 years ago

That still seems to be the same to me. In the case of an input array the null values are in the array [1,null,2,5,null] not the array input it self so the handling of those nulls is up to the count operation not the framework.

kstaken commented 1 month ago

These have been in production for a long time. I think we can close this now. Any issues will get new tickets.

terascope / teraslice