jf-tech / omniparser

omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
MIT License
931 stars 68 forks source link

adding ndjson format #218

Closed jose-sherpa closed 12 months ago

jose-sherpa commented 1 year ago

While the omniparser tool outputs JSON format currently, you will often need another tool or package to stream the JSON output. While I am aware this tool will only be used for JSON output, there is a type of JSON called NDJSON which stands for new line delimited JSON. This makes it easy to stream parse and process a JSON array with no added packages or complexity since you just read each line and parse them one by one. Since a strength of omniparser is to stream parse large files, we think it makes sense to make the output easily streamable without violating the output of JSON. It also results in a smaller file size.

http://ndjson.org/

jose-sherpa commented 1 year ago

If accepted I will add the tests and refactor

codecov[bot] commented 1 year ago

Codecov Report

All modified lines are covered by tests :white_check_mark:

Comparison is base (79a540b) 100.00% compared to head (b26cb6d) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #218 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 53 53 Lines 3041 3041 ========================================= Hits 3041 3041 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

jose-sherpa commented 1 year ago

Oh and just curious @jf-tech , have you thought about updating the go version? From my experience it's completely backwards compatible and has had some good improvements on the performance side and new methods

jf-tech commented 1 year ago

@jose-sherpa

1) if the desired effect is just to change the output from the command line tool, not the actual output format (which is currently standard JSON) from the Transform.Read() interface, then I would be against putting the setting into the schema. It seems to be more of a command line option that would be the better place for the switch. What do you think?

Yes, maybe in the future, we actually want to introduce different output formats for the transform interface, offering callers the choice of the current JSON, and maybe protobuf, or XML, or something else. At that time, we need to introduce the proper `ParserSettings` option and an extensible and pluggable mechanism for hooking up various writers. But doesn't seem to be the case in your request.

2) As for go version, we pinned our go version to 1.14 against the underlying goja (the javascript engine version) at that time because that was the version we thoroughly tested against. Now most likely upgrading to a higher version is fine, but that might unknowingly break other dependent customers who might still be running on a lower go version. So our philosophy is unless we start to use some go features only available in a higher version, we will stick with the current one, well, until sometime in the future it's too antiquated or Google simply stops supporting it.

Thoughts? And again, thanks for chiming in and contributing!! Love it!

jose-sherpa commented 1 year ago

@jose-sherpa

  1. if the desired effect is just to change the output from the command line tool, not the actual output format (which is currently standard JSON) from the Transform.Read() interface, then I would be against putting the setting into the schema. It seems to be more of a command line option that would be the better place for the switch. What do you think? Yes, maybe in the future, we actually want to introduce different output formats for the transform interface, offering callers the choice of the current JSON, and maybe protobuf, or XML, or something else. At that time, we need to introduce the proper ParserSettings option and an extensible and pluggable mechanism for hooking up various writers. But doesn't seem to be the case in your request.
  2. As for go version, we pinned our go version to 1.14 against the underlying goja (the javascript engine version) at that time because that was the version we thoroughly tested against. Now most likely upgrading to a higher version is fine, but that might unknowingly break other dependent customers who might still be running on a lower go version. So our philosophy is unless we start to use some go features only available in a higher version, we will stick with the current one, well, until sometime in the future it's too antiquated or Google simply stops supporting it.

Thoughts? And again, thanks for chiming in and contributing!! Love it!

Thanks for the prompt reply! I think it makes sense to have it as a command line option, I'll make that change!

jose-sherpa commented 1 year ago

@jf-tech I'll make the changes and add tests, thank you!

jose-sherpa commented 1 year ago

@jf-tech I'm looking through the tests and I think this should be in extensions/omniv21/samples however when looking in there it seems to use SampleTestCommon which does it's own logic for making the json output. Is there a different test that tests through the command? Or should I make it so the output logic are in different functions that SampleTestCommon can use?

Maybe something like this will allow multiple places to use the same writing that the transform command uses:

package writer

import (
    "github.com/jf-tech/go-corelib/jsons"
    "github.com/jf-tech/go-corelib/strs"
    "github.com/jf-tech/omniparser"
    "io"
    "strings"
)

type Writer interface {
    Write(printf func(string, ...any) (int, error), println func(...any) (int, error)) error
}

func NewJSON(transform omniparser.Transform) Writer {
    return &writer{
        LParen: "[\n%s",
        Delim:  ",\n%s",
        RParen: "\n]",
        Empty:  "[]",
        DoOne: func() (string, error) {
            b, err := transform.Read()
            if err != nil {
                return "", err
            }

            return strings.Join(
                strs.NoErrMapSlice(
                    strings.Split(jsons.BPJ(string(b)), "\n"),
                    func(s string) string { return "\t" + s }),
                "\n"), nil
        },
    }
}

func NewNDJSON(transform omniparser.Transform) Writer {
    return &writer{
        LParen: "%s",
        Delim:  "\n%s",
        RParen: "",
        Empty:  "",
        DoOne: func() (string, error) {
            b, err := transform.Read()
            if err != nil {
                return "", err
            }

            return string(b), nil
        },
    }
}

type writer struct {
    LParen string
    Delim  string
    RParen string
    Empty  string
    DoOne  func() (string, error)
}

func (w *writer) Write(printf func(string, ...any) (int, error), println func(...any) (int, error)) error {
    record, err := w.DoOne()
    if err == io.EOF {
        _, _ = println(w.Empty)
        return nil
    }
    if err != nil {
        return err
    }

    _, _ = printf(w.LParen, record)
    for {
        record, err = w.DoOne()
        if err == io.EOF {
            break
        }
        if err != nil {
            return err
        }
        _, _ = printf(w.Delim, record)
    }
    _, _ = println(w.RParen)
    return nil
}

which reduces doTransform to:

func doTransform() error {
    schemaName := filepath.Base(schema)
    schemaReadCloser, err := openFile("schema", schema)
    if err != nil {
        return err
    }
    defer schemaReadCloser.Close()

    inputReadCloser := io.ReadCloser(nil)
    inputName := ""
    if strs.IsStrNonBlank(input) {
        inputName = filepath.Base(input)
        inputReadCloser, err = openFile("input", input)
        if err != nil {
            return err
        }
        defer inputReadCloser.Close()
    } else {
        inputName = "(stdin)"
        inputReadCloser = os.Stdin
        // Note we don't defer Close() on this since os/golang runtime owns it.
    }

    schema, err := omniparser.NewSchema(schemaName, schemaReadCloser)
    if err != nil {
        return err
    }

    transform, err := schema.NewTransform(inputName, inputReadCloser, &transformctx.Ctx{})
    if err != nil {
        return err
    }

    var w writer.Writer
    if ndjson {
        w = writer.NewNDJSON(transform)
    } else {
        w = writer.NewJSON(transform)
    }

    return w.Write(fmt.Printf, fmt.Println)
}
jf-tech commented 1 year ago

@jose-sherpa sorry for the delay. We were out/offline for several days.

First I'm not seeing the updated PR. Did you push your updates (based on my feedbacks) to origin? Second, I just checked, the dir cli actually has no tests, because originally the purpose of it to create a quick cli tool for interactive schema authoring and a HTTP endpoint for playground (which retired last year).

So I would say don't worry about the tests for changes in this directory. Just push your latest updates to the PR and we'll take a look again and approve.

jose-sherpa commented 1 year ago

@jf-tech hey sorry I was out of town, yeah I made the changes you suggested unless you see I missed something. The only commit I did not push was the commit containing the changes I outlined in my previous comment. If you like those changes let me know and I can push that commit