Closed ciorg closed 1 month ago
The goals for these functions are to provide a consistent set of data validation and transform functions usable in all contexts
Given this broad suite of use cases changes have to be cognizant of the requirements for all use cases. So things like "Removing the use of parent context since that is not available in DataFrames" is not the correct approach as breaking the library because of a limitation in one context is problematic.
The correct approach has 2 parts:
I think one of the key questions is going to be how strongly typed this library is. Part of the problem is that a big part of the reason for this library is to help validate the application of types. It's worth considering if we make that more explicit as it's definitely going to be an issue as we push more into QPL context where types are required.
For DataFrames and QPL we're also going to need to define a suite of common base functions to manipulate particular types of data. This means stuff that exists at the basic level in Javascript like String related functions will need to be included in this library and adapted for the chaining workflow. This will be the case for all basic types:
What we surface for this will be a balancing act as we don't want to surface all the core functions that you can use in Javascript but we do want to surface enough that users have enough power to solve the majority of data level problems they will face.
Each category above will need to be considered in depth separately. Primary usage will be QPL so we'll want to focus on how they function and combine in that context. We'll also want to consider namespacing and future extensions.
On transform type functions that may be operating on nested JSON data we're going to want a consistent way to select and flatten the data. This may be related to the Expression syntax or it may be something different I'm not sure but it is something we need to consider. The purpose of this is to just select portions of the JSON tree for further processing.
In the reporting prototype code I had used this library: https://www.npmjs.com/package/@gizt/selector
Obviously in some scenarios we also use xLucene as a selector syntax but it's really more of an expression language as it doesn't have clean ability to access deep into JSON docs. (One of the things we want to solve with xLucene integration with this lib) so you might be able to say something like select(path.sub.array[0].value): 100 AND select(path.sub2.array[0].value): add(select(path.sub.array[0].value), 100)
. Obviously a contrived example but as we increase usage of xLucene in larger contexts like QPL we'll want more power to control it.
We also need to consider this in relation to regex as it is also technically a selector syntax although fundamentally different in usage and intention.
This is related to the concept of Selectors and to xLucene but brings in conditionals and a little more scripting type logic without exposing a full scripting language. We have looked to Jexl for this so far: https://www.npmjs.com/package/jexl
I'm not entirely enthused about the idea of having selectors, xLucene, regex and Jexl which opens the question if maybe xLucene in general should be extended to support all cases directly. That's outside the scope of this library but does affect how we approach some things here so I wanted to mention it.
Overall the expression syntax in QPL context would be a fall back that should almost never be used but when you need it, it will be important and I believe it comes from this library rather than a language level feature but could be wrong.
The model here, for influence, are database level function sets. Obviously these are very mature systems with many, many functions so we're not trying to clone these capabilities in general but the functions we're developing are intended to serve the same role.
Systems most similar to QPL and some idea of what a function library will "eventually" need to cover.
More mature Relational Databases
Also since one of our goals is going to be to build a SQL dialect for QPL we should consider functions in reference to the SQL standard. It's hard to find a simple list of what the standard actually supports but it appears fairly rudimentary. The PostgreSQL link does make reference to SQL92 functions and an out of date copy of the 92 spec is here: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
After thinking about the changes a bit more, I want to revise my list of suggestions:
config
to argument_schema
or argument_config
to be more explicit what type of config it representsrequires_parent_context: boolean
to the configprocess_mode: ProcessMode
(only available to Field level functions), which can either be used determine the iteration method, INDIVIDUAL_VALUES
(non-nil and unwrapped from arrays) or ALL_VALUES
(all values, nil and array values included)accepts: readonly FieldType[]
to the config, this will be used for defining the acceptable input field types.
String
field type, it will apply to all field types that are represented a string
Number
field type, it will apply to all field types that are represented a number
output_type: FieldType
to output_type: DataTypeFieldConfig|(inputFieldType: DataTypeFieldConfig) => DataTypeFieldConfig
which will allow making the return type dynamic. Returning the whole config will allow us to change array
or other field type properties.O(1)
not O(n)
. See example 1.parseDate
and formatDate
into a single toDate
BigInt
DateValue
IPValue
I cannot express how vital naming is here, everything from the function naming to arguments. Breaking those will be VERY difficult, if we don't get it right now, it will stick around forever. There should be a consistency in the naming AND it should be intuitive for humans who do NOT know how to code.
Note this roughly based on the current column trim function.
export type FieldTransformTrimArgs = {
char?: string;
};
/**
* Trim whitespace, or specific character, from the beginning and end of a string
*
* @example
*
* trim()
* // ' left' => 'left'
* // 'right ' => 'right'
* // ' center ' => 'center'
* // ' ' => ''
* trim(char: 'fast')
* // 'fast cars race fast' => ' cars race '
* trim(char: '.*')
* // '.*.*a regex test.*.*.*.*' => 'a regex test'
* trim(char: '\r')
* // '\t\r\rexample\r\r' => 'example'
*/
export const trimFieldTransformConfig: FieldTransformConfig<string, string, TrimArgs> = {
type: TransformType.FIELD_TRANSFORM,
process_mode: ProcessMode.INDIVIDUAL_VALUES,
create: (args: FieldTransformTrimArgs, inputType?: DataTypeFieldConfig, parentContext?: Record<string, unknown>) => {
validateArguments(trimFieldTransformConfig.argument_schema, args);
// this is where you'd also throw if the input type
// is not correct
return trimFP(args.chart);
},
description: 'Trim whitespace, or specific character, from the beginning and end of a string',
argument_schema: {
char: {
type: FieldType.String,
description: 'The character to trim'
}
},
requires_parent_context: false,
accepts: [FieldType.String],
output_type: (inputType: FieldTypeConfig) => {
if (isNotStringType(inputType.type)) return inputType;
return { ...inputType, type: FieldType.String };
}
};
SIDE NOTE: Also I think there is one thing we've learned, we can't trust validator.js or other dependencies, so we need comprehensive tests of our own. When possible we should submit a PR to fix validator.js or the other dependencies.
Some notes after our call:
function simpleFunctionAdapter(
fnDef: FunctionDefinition,
input: Record<string, unknown>[],
args: Record<string, unknown>,
): Record<string, unknown>[] {
// this where the function definition
// execution behavior will change
// depending on the process_mode
return output;
}
@jsnoble double check on parent context use because if we aren't using it ts-transforms, we should remove it since it might be premature.
Also I can't remember if we added generic aggregation functions BUT i'd imagine those too premature right now to keep. The problem with those is that might require very context specific operations.
After further discussion with @peterdemartini, we came up with an initial game plan on how adapters work, but there are many questions on expected behavior
NOTE: we talked about removing jexl expressions support from the extract
Field-Transform, and stick it into a Record-Transform as it innately works on the entire object itself
Questions:
initial spec
process_mode
:
/** This indicates that it operations on non-nil individual values */
INDIVIDUAL_VALUES = 'INDIVIDUAL_VALUES',
/** This indicates that it operations on entire values, including nulls and arrays */
FULL_VALUES = 'FULL_VALUES'
This is not flexible enough as is and might need to take into account the Function Definition Type into account to refine its behavior. validators should always be able to work on nil values while transforms should never be able to. Its still a little toss up in the air about aggregators with DataFrames, need to further work with Peter on that.
list of possible behavior:
Field-Validators with INDIVIDUAL_VALUES
should accept nil, individual values, or call fn on each item in a listField-Validators with FULL_VALUES
should accepts nil, or the whole value as is, used for validators like isObject or isArray or things like thatField-Transforms with INDIVIDUAL_VALUES
should NOT accept nil, but call on individual values, or call fn on each item in a list with NON-NIL values or it will throwField-Transforms with FULL_VALUES
should NOT accept nil, but call on entire value, maybe for something like @toJSON or something like thatField-Aggregations only works with FULL_VALUES
can accept nil, but call on entire value, like @maxRecord-Validatiors only works with FULL_VALUES
, can accept nil values as wellRecord-Transforms only works with FULL_VALUES
, should NOT accept nil valuesRecord-Aggregations only works with FULL_VALUES
but requires input to be an an array of objects Question 1
: Would it be explicit enough to take into account the function type along with the process_mode to determine the behavior, or should we be super explicit and require a unique process_mode for each possible scenario despite the type?
Question 2
: How flexible should the general function adapter be in regards to input (not necessarily the DataFrame adapter as that is more constrained)?
initial thoughts:
import { isBooleanConfig , functionAdapter } from '../src'
const isBoolean = functionAdapter(isBooleanConfig);
isBoolean(undefined) => false
isBoolean(false) => true
isBoolean('true') => false
isBoolean(true) => true
isBoolean([ true, false]) => [true, true]
isBoolean([ true, null, false]) => [true, false, true]
const isBooleanWithField = functionAdapter(isBooleanConfig, 'someField');
isBoolean({ someField: true }) => true
isBoolean({ otherField: true }) => false
isBoolean([{ someField: true }, { otherField: true }]) => [true, false]
const isBooleanWithNestedField = functionAdapter(isBooleanConfig, 'some.nested[0].field');
const data = {
some: {
nested: [{ field: true }]
}
};
isBoolean(data) => true
// NOTE that if a nested field starts off with an [] selector, it does not work on the entire array as isBooleanConfig
// `process_mode` is set to `INDIVIDUAL_VALUES`, so it has to execute on an array of arrays
const isBooleanWithNestedFieldArray = functionAdapter(isBooleanConfig, '[0]some.nested[0].field');
const data = [
[
{
some: {
nested: [{ field: true }]
}
},
{
some: {
nested: [{ field: 'I will be missed' }]
}
}
],
[
{
some: {
nested: [{ field: 'true' }]
}
},
{
some: {
nested: [{ field: 'I will be missed' }]
}
}
]
];
isBoolean(data) => [true, false]
// In order to check all the data, it would have to be flattened and the field take off the first [] selector
@ciorg do any of the existing ts-transforms rules we run use jexl?
Is there a reason Field-Aggregations
are a separate concept from Field-Transforms
? Seems to me an aggregation could just be considered a transform of Array input -> value output.
no jexl expressions in our ts-transforms rules
Is there a reason
Field-Aggregations
are a separate concept fromField-Transforms
? Seems to me an aggregation could just be considered a transform of Array input -> value output.
It can be viewed that way but it could depend on a few distinctions. In my comment about behaviors I mentioned that transformations cannot handle nil values will aggregation possibly could, like what happens in @countField, though it could be different with DataFrames
That still seems to be the same to me. In the case of an input array the null values are in the array [1,null,2,5,null]
not the array input it self so the handling of those nulls is up to the count operation not the framework.
These have been in production for a long time. I think we can close this now. Any issues will get new tickets.
Initial thoughts on additional data-mate functions or changes. Related to ticket #2242
There are functions in ts-transforms,
isIn
,isNumeric
,matches
, etc..., that are imported from the Validator package, but are not available in data-mate. Seems like all validations should draw from one source, and any external packages should go to the lowest level first and be accessed from there.Validation additions:
Validation Changes
bob@bob@bob.com
. This isn't the functionality I would expect.All the other additional functions I had are found in ts-transforms from the validator module.