multiprocessio / datastation

App to easily query, script, and visualize data from every database, file, and API.
https://datastation.multiprocess.io
Other
2.9k stars 112 forks source link

Prevent over-eager merging of Scalars of same type into Varied kind #205

Closed fritzgrabo closed 2 years ago

fritzgrabo commented 2 years ago

Hi 👋

While working on ob-dsq, I noticed that getting the "Shape" of a JSON file converts shapes of a Scalar kind into a Varied kind when merging Object shapes, even when all sampled rows happen to have the same Scalar kind and type in the merged JSON attribute.

Here's an example to illustrate what I mean:

$ cat ids-irregular.json 
[
  {"id": 1, "name": "Jorge", "color": "blue"},
  {"id": 2, "name": "Mona", "language": "Ruby"}
]

$ dsq ids-irregular.json "SELECT * FROM {}"
[{"color":"blue","id":"1.0","language":null,"name":"Jorge"},
{"color":null,"id":"2.0","language":"Ruby","name":"Mona"}]

Note how the id column's shape got merged into a Varied kind (TEXT) even though all rows in the input file are numbers.

I'm not entirely sure if this was done on purpose to err on the side of safety when working with input files that have rows of varying structure? There's a comment that seems to indicated that, but I can read it both ways :)

In this PR, I propose to try hard to keep a column's shape untouched if all the sampled rows in the input happen to have the same Scalar kind and type.

$ dsq ids-irregular.json "SELECT * FROM {}"
[{"color":"blue","id":1,"language":null,"name":"Jorge"},
{"color":null,"id":2,"language":"Ruby","name":"Mona"}]

This could help with unexpected results in queries, too. For example, "SELECT * FROM {} WHERE id=1" yields an empty result in the unpatched version because 1 doesn't match "1.0".

Please let me know what you think. Thanks!

eatonphil commented 2 years ago

Hey! Great work finding the issue and proposing a fix! I think that you're right this was not intended. I'll take a closer look today to make sure but if all existing tests pass plus your new one then there's a good chance you're right.

eatonphil commented 2 years ago

Looks good! It's just failing because you don't have access to github actions secrets. Merging. Thank you!

eatonphil commented 2 years ago

Updated dsq to vendor in this change: https://github.com/multiprocessio/dsq/pull/40.

New version of dsq is out now, https://github.com/multiprocessio/dsq/releases/tag/0.11.0.

fritzgrabo commented 2 years ago

@eatonphil That's awesome, thank you! I'll update the affected examples in ob-dsq to reflect the fix later tonight.

PS: Really glad this was such a quick and painless fix in the end. It's also kind of a testament to the quality of the code of dsq and DataStation: I've had no prior exposure to go whatsoever (read or write) and could find and fix this in a single evening. Yay!

eatonphil commented 2 years ago

Wow that's awesome you could figure it out without knowing Go. And the shape library is basically the trickiest part of the code so again, great work!