jvalue / jayvee

Jayvee is a domain-specific language and runtime for automated processing of data pipelines
https://jvalue.github.io/jayvee/
117 stars 11 forks source link

feat: enable transforming of value types #557

Closed TungstnBallon closed 4 months ago

TungstnBallon commented 5 months ago

This PR allows to parse text into the builtin primitives decimal, integer and boolean

example file:

pipeline CarsPipeline {

    CarsExtractor -> CarsTextFileInterpreter;

    CarsTextFileInterpreter
        -> CarsCSVInterpreter
        -> NameHeaderWriter
        -> CarsTableInterpreter
        -> DispTransform
        -> HpTransform
        -> CarsLoader;

    block CarsExtractor oftype HttpExtractor {
        url: "https://gist.githubusercontent.com/noamross/e5d3e859aa0c794be10b/raw/b999fb4425b54c63cab088c0ce2c0d6ce961a563/cars.csv";
    }

    block CarsTextFileInterpreter oftype TextFileInterpreter { }

    block CarsCSVInterpreter oftype CSVInterpreter {
        enclosing: '"';
    }

    block NameHeaderWriter oftype CellWriter {
        at: cell A1;

        write: ["name"];
    }

    block CarsTableInterpreter oftype TableInterpreter {
        header: true;
        columns: [
            "name" oftype text,
            "mpg" oftype decimal,
            "cyl" oftype integer,
            "disp" oftype text,
            "hp" oftype text,
            "drat" oftype decimal,
            "wt" oftype decimal,
            "qsec" oftype decimal,
            "vs" oftype integer,
            "am" oftype integer,
            "gear" oftype integer,
            "carb" oftype integer
        ];
    }

    transform parseDisp {
        from t oftype text;
        to d oftype decimal;
        d: asDecimal t;
    }

    block DispTransform oftype TableTransformer {
        inputColumns: ["disp"];
        outputColumn: "disp";
        use: parseDisp;
    }

    transform parseHp {
        from t oftype text;
        to i oftype integer;
        i: asInteger t;
    }

    block HpTransform oftype TableTransformer {
        inputColumns: ["hp"];
        outputColumn: "hp";
        use: parseHp;
    }

    block CarsLoader oftype SQLiteLoader {
        table: "Cars";
        file: "./cars.sqlite";
    }
}

closes #543

TungstnBallon commented 5 months ago

Cool, nice work!

Thanks :)

Have you tried parsing something unparseable? What is the current behavior and does the user get an understandable log?

Rows where the parsing fails get excluded from the resulting table. The error is only visible with the -d flag. e.g.

> nx run interpreter:run -d example/parse.jv

[CarsPipeline] Overview:
    Blocks (9 blocks with 2 pipes):
    -> OCarsExtractor (LocalFileExtractor)
        -> CarsTextFileInterpreter (TextFileInterpreter)
            -> CarsCSVInterpreter (CSVInterpreter)
                -> NameHeaderWriter (CellWriter)
                    -> CarsTableInterpreter (TableInterpreter)
                        -> DispTransform (TableTransformer)
                            -> HpTransform (TableTransformer)
                                -> CarsLoader (SQLiteLoader)

    [OCarsExtractor] Successfully extraced file /home/jonas/Downloads/parse.csv
    [OCarsExtractor] Execution duration: 1 ms.
    [CarsTextFileInterpreter] Decoding file content using encoding "utf-8"
    [CarsTextFileInterpreter] Splitting lines using line break /\r?\n/
    [CarsTextFileInterpreter] Lines were split successfully, the resulting text file has 33 lines
    [CarsTextFileInterpreter] Execution duration: 1 ms.
    [CarsCSVInterpreter] Parsing raw data as CSV using delimiter ","
    [CarsCSVInterpreter] Parsing raw data as CSV-sheet successful
    [CarsCSVInterpreter] Execution duration: 10 ms.
    [NameHeaderWriter] Writing "name" at cell A1
    [NameHeaderWriter] Execution duration: 1 ms.
    [CarsTableInterpreter] Matching header with provided column names
    [CarsTableInterpreter] Validating 32 row(s) according to the column types
    [CarsTableInterpreter] Validation completed, the resulting table has 32 row(s) and 12 column(s)
    [CarsTableInterpreter] Execution duration: 1 ms.
    [DispTransform] Column "disp" will be overwritten
    [DispTransform] Column "disp" will change its type from text to decimal
        [parseDisp] Invalid value in row 1: "NaN" does not match the type decimal
    [DispTransform] Execution duration: 1 ms.
    [HpTransform] Column "hp" will be overwritten
    [HpTransform] Column "hp" will change its type from text to integer
    [HpTransform] Execution duration: 1 ms.
    [CarsLoader] Opening database file ./cars.sqlite
    [CarsLoader] Dropping previous table "Cars" if it exists
    [CarsLoader] Creating table "Cars"
    [CarsLoader] Inserting 31 row(s) into table "Cars"
    [CarsLoader] The data was successfully loaded into the database
    [CarsLoader] Execution duration: 13 ms.
[CarsPipeline] Execution duration: 30 ms.

IMO the interpreter should't crash in this case, but a more visible error is necessary. I don't really know how to do this though, so some pointers would be welcome.

TungstnBallon commented 4 months ago

Invalid value in row 1: "NaN" does not match the type decimal -> "can not be cast to type decimal".

The error message is now

[parsefailer] Could not parse "Mazda RX4" into a Decimal
[parsefailer] Dropping row 1: Could not evaluate transform expression