shuttle-hq / synth

The Declarative Data Generator
https://www.getsynth.com/
Apache License 2.0
1.38k stars 109 forks source link

Support JSON Lines format (.jsonl) #183

Closed mrsarm closed 2 years ago

mrsarm commented 3 years ago

Required Functionality Currently JSON is supported and CSV is going to be supported (https://github.com/getsynth/synth/issues/33), well JSON Lines is like CSV, but with the power of JSON, it's very used to export big chunks of data with columns with complex values not supported by CSV, and where JSON is inefficient. Because one line is one record like CSV, it's much efficient than JSON to import, and it's used among other things because it's really easy to build a JSON line from a JSON file that is filtered / post-processed with tools like jq.

Proposed Solution Support it, and because only one type of data like CSV is supported, either assume that the file only has one type of data, or allow to set as argument the name of the column that determine the type of data (this also applies to CSV). E.g. in CouchDB there are no tables or collections, everything is store in the same database, but by convention the field "type" is used to discriminate what is the type of the record: users, transactions, claims...

Use case As mentioned above, it's easy to generate and work with JSON Lines. E.g. here is a transformation example made with jq. :

$ echo '{ "transactions": [ {"tx":123, "amount": 100, "cust_id": 444}, {"tx":123, "amount": 100, "cust_id": 444} ]  }' \
    | jq '.transactions[]|{tx:.tx,val:.amount}' -c
{"tx":123,"val":100}
{"tx":123,"val":100}

Another example: although the documentation of mongoexport says it export to JSON, in reality the output format is JSON Lines.

christos-h commented 3 years ago

Hey @mrsarm thanks for this.

There is the question of CouchDB integration in general which we can build an integration for (perhaps opening a separate issue for this).

On the topic of JSON lines I think this makes a lot of sense. I assume it would work on a per collection basis, i.e.:

synth generate my_namespace --collection transactions --to jsonl
{"tx": 123, "val"100}
...

I wonder how this would work for a namespace, with multiple collections.

mrsarm commented 3 years ago

Yes, the syntax you propose for generation makes sense, and should be supported when importing too. CSV and JSON Lines would be treated in the same way, using the --collection argument to set the collection/table.

I wonder how this would work for a namespace, with multiple collections.

Yes, for JSON Line one of the field can be used to determine the collection (perfect match for data imported or exported from Couch where by convention the "type" field is used for that, but can applies to other sources too). But that isn't the case for CSV, where you have fixed columns, so each collection needs its own file.

bmoxb commented 3 years ago

I've just started working on implementing this. For namespaces with multiple collections I'm thinking of doing something like this (assuming two collections 'numbers' and 'bools' that generate random-length arrays of random numbers and random Boolean values respectively):

{"collection": "numbers", "data": [12, 68, 32]}
{"collection": "numbers", "data": [69, 27, 37, 43]}
{"collection": "bools", "data": [true, true, false, true]}
{"collection": "bools", "data": [false, true]}

Were --collections numbers specified the output would look something like this instead:

[12, 68, 32]
[69, 27, 37, 43]
mrsarm commented 3 years ago

Hi @WiredSound . Is there any special reason to use the array notation for the values ?

I worked in 3 different scenarios in the past with JSON Lines data an none one looks to export the data like that, so maybe in practice won't be useful at all. That spec makes the JSON Line format to look like CSV, but people that handles JSON Lines are using it because don't want the limitations of CSV in the first place, but the goods: one line → one record.

Here are the examples that are real production examples I had in the past:

MongoDB export

The Mongoexport documentation says it export to JSON, but try it and you will see that it's actually JSON Line, and the data is not exported with the array notation you mention, but instead it exports one file per collection.

You may think that having a direct integration with MongoDB makes this use case pointless, but actually I think would be more common for teams to work with exports of Mongo databases that directly accessing it. Accessing it directly involves give production credential and unrestricted access to the data (at least in read-only mode) to a collaborator that is not the owner of the data. So in the past when we used to work a lot in a company with Mongo, if we needed to give a developer an export from production or even a staging environment, we used to use mongodump or mongoexport instead. Mongoexport also allows you to easily filter or edit data once exported, so is better although less efficient than Mongodump.

Zyte (formerly Scrapinghub)

I worked in this company a few years ago, basically it's a "web scraping as a service" company. The best format to handle data was JSON Line (and I guess it is today). Normally we used to deliver all the data in one JSON line file with only one type of record, e.g. products of a webstore. But when more than one type of record were needed, e.g. products and reviews, we used to discriminate the type of record with one of the fields in the record, normally we used a field named "type", although some times customers asked to use a different name for the field.

CouchDB

As I mentioned earlier, the concept of "table" or "collection" in Couch doesn't exist, everything is stored in the same bucket, but by convention a field with the name "type" is used to discriminate different type of records, but it's a convention, so the synth CLI should allow to set the name of the field by argument.

bmoxb commented 3 years ago

Hi @mrsarm , sorry, I gave something of an odd example given the use case. I've just experimented with the mongoexport tool and think I have a better understanding of why JSON Lines support would a useful feature.

I think directly adding a 'type' field to generated objects for namespaces with multiple collections is the best way forward too. Some sort of --collection-field-name type or something along those lines is a great idea also.

Thank you for further explanation, I feel I have a better understanding of the aim now!