neilotoole / sq

sq data wrangler
https://sq.io
MIT License
2.09k stars 33 forks source link

inspecting and selecting large json files #415

Open c-nv-s opened 7 months ago

c-nv-s commented 7 months ago

Describe the bug I was preparing to copy a json file to sqlite (similar to https://sq.io/docs/cookbook#import-json-array-to-database ) but could not because I encountered some buggy behaviour. sample data: https://github.com/lutangar/cities.json/raw/master/cities.json

# use jq to check it is valid json
cat cities.json | jq '.'

To Reproduce

$ sq add ./cities.json
$ sq inspect @cities
sq: failed to inspect @cities: invalid JSON: additional input expected

$ sq '@cities | .[]'
sq: invalid JSON: additional input expected

Expected behavior

Not actually sure to be honest, but definitely not what was received haha

sq version

"version": "v0.47.2", "commit": "135318f", "timestamp": "2024-01-30T05:13:40Z", "latest_version": "v0.48.1",

neilotoole commented 7 months ago

@c-nv-s Thanks, I'm looking into this now. What would I do without you? 🙏

neilotoole commented 7 months ago

@c-nv-s I think the issue is tackled in v0.48.3. Can you give it a whirl?

neilotoole commented 7 months ago

@c-nv-s... Hmmmn, I'm seeing different behavior.

$ sq inspect @cities
SOURCE   DRIVER  NAME               FQ NAME            SIZE    TABLES  VIEWS  LOCATION
@cities  json    cities.large.json  cities.large.json  19.9MB  1       0      /Users/neilotoole/work/sq/sq/drivers/json/testdata/cities.large.json

NAME  TYPE   ROWS    COLS
data  table  146994  name, lat, lng, country, admin1, admin2
$ sq '@cities.data | .[]'
name                                                             lat        lng         country  admin1    admin2
Sant Julià de Lòria                                              42.46372   1.49129     AD       6         NULL
Pas de la Casa                                                   42.54277   1.73361     AD       3         NULL
[SNIP]

So, @cities.data | .[] command works for me... We need to figure out why it's not working for you. I'll come back to that shortly.

Standalone period '.'

$ sq '@cities.data | .' 

So, that's not currently valid sq syntax. You just need to do @cities.data. The last segment (| .) is the column selector segment, where typically you'd have .name, .country etc. So, . doesn't work. BUT... now that I see you instinctively using ., I think I might add that, where . would just be an identity operator, effectively selecting all columns.

jsonl

Also, the jsonl example works just fine for me too.

Debugging

So, back to the hanging part.

What happens when you do the following (note that I've named the file cities.large.json because I also have a smaller version for debugging):

$ cat cities.large.json | sq '.data | .[0:3]'

This is what I see:

cat_cities_large_json

And then after processing, I get the expected result.

To debug this, can you provide the output of:

We'll get to the bottom of this!

c-nv-s commented 7 months ago

sorry neil please disregard that last comment, I deleted it because I was testing it in a container that was still on v47 I'm currently rerunning the operation with v48 so will let you know if it succeeds... it is taking its time

neilotoole commented 7 months ago

@c-nv-s If running in a container, you probably won't see the progress bar I suppose? And yes, it does take its time to ingest (although thankfully this is now a one-time cost with the ingest caching that was implemented a few versions ago). It takes about 35s on my beefy macbook, I imagine the container could be significantly less beefy?

Having been digging around in the JSON ingest process recently, I wouldn't be surprised if I could speed it up by an order of magnitude. That's a story for another day though.

c-nv-s commented 7 months ago

I do see the progress bar, that is working fine, and yes the container is resource limited so patience is required. performance is not a high priority requirement at present but a future review would obviously be welcomed.

neilotoole commented 7 months ago

@c-nv-s The JSON ingester definitely needs some serious love. Compare it with the CSV ingester (which admittedly is the simplest possible data format):

$ time cat cities.large.json | sq '.data | count'
count
146994
cat cities.large.json  0.00s user 0.03s system 0% cpu 34.013 total
sq '.data | count'  3.27s user 26.91s system 88% cpu 34.154 total

$ time cat cities.large.csv | sq '.data | count'
count
146994
cat cities.large.csv  0.00s user 0.00s system 2% cpu 0.271 total
sq '.data | count'  0.46s user 0.06s system 184% cpu 0.283 total

So, that's 34 seconds for JSON, and 0.28 seconds for CSV... 😬

There is a lot of low-hanging fruit for performance improvements in the JSON ingester. But the main reason it's so slow is that it's designed to deal with any JSON that's thrown at it: nested objects, arrays, fields that change type (e.g. from integer to text, which actually happens in that cities.json file). If the ingester knew ahead of time that the JSON was flat, fields were always the same type, etc, then it could go way faster.

I'm certain there's a good strategy to be found that combines the best of both worlds, but I haven't spent any time on it yet. Plus, I figured you'd rather see #143 happen first...

c-nv-s commented 7 months ago

yes, sadly it looks like I'm going to have to leave this operation running overnight to see the outcome which is a shame but not the end of the world. You are correct that right know 143 is still top of my wishlist and if this current fix works that will more than suffice for now, however as an alternative for the "safe ingest" approach maybe you could offer a type of "streaming ingest/insert" which ignores/logs errors for any schema mismatch if that makes sense?

c-nv-s commented 7 months ago

FYI it looks like the import of the jsonl still doesn't work. As you mentioned, the determination of the schema type is tricky because it changes from int64 to string but I haven't looked into how you've implemented any fallback.

sq '@cities.data | .[]' --insert @mysqlite.cities

sq: insert @mysqlite.cities failed: query against @cities: sql: Scan error on column index 0, name "admin1": converting driver.Value type string ("BRC") to a int64: invalid syntax

The schema on the sql was created before the attempt to import the data (see below for schema) To avoid Null Entry errors I also added default values of "00" for fields .admin1 and .admin2 where a value didn't exist:

cat cities.json | jq -c '.[] | def n: if . == "" then null else . end ; .admin1 = ((.admin1|n) // "00") | .admin2 = ((.admin2|n) // "00")' > cities.jsonl

the existing schema for the sqlite table on my system looks like this:

{                   
  "name": "cities",                                                                            
  "name_fq": "main.cities",
  "table_type": "table",
  "table_type_db": "table",
  "row_count": 0,
  "columns": [
    {
      "name": "admin1",
      "position": 0,
      "primary_key": false,
      "base_type": "TEXT",
      "column_type": "TEXT",
      "kind": "text",
      "nullable": false,
      "default_value": "''"
    },
    {
      "name": "country",
      "position": 1,
      "primary_key": false,
      "base_type": "TEXT",
      "column_type": "TEXT",
      "kind": "text",
      "nullable": false,
      "default_value": "''"
    },
    {
      "name": "created",
      "position": 2,
      "primary_key": false,
      "base_type": "TEXT",
      "column_type": "TEXT",
      "kind": "text",
      "nullable": false,
      "default_value": "strftime('%Y-%m-%d %H:%M:%fZ')"
    },
    {
      "name": "id",
      "position": 3,
      "primary_key": true,
      "base_type": "TEXT",
      "column_type": "TEXT",
      "kind": "text",
      "nullable": false,
      "default_value": "'r'||lower(hex(randomblob(7)))"
    },
   {                                                                                                                                                                                 [2/1933]
      "name": "lat",
      "position": 4,
      "primary_key": false,
      "base_type": "NUMERIC",
      "column_type": "NUMERIC",
      "kind": "decimal",
      "nullable": false,
      "default_value": "0"
    },
    {
      "name": "lng",
      "position": 5,
      "primary_key": false,
      "base_type": "NUMERIC",
      "column_type": "NUMERIC",
      "kind": "decimal",
      "nullable": false,
      "default_value": "0"
    },
    {
      "name": "name",
      "position": 6,
      "primary_key": false,
      "base_type": "TEXT",
      "column_type": "TEXT",
      "kind": "text",
      "nullable": false,
      "default_value": "''"
    },
    {
      "name": "updated",
      "position": 7,
      "primary_key": false,
      "base_type": "TEXT",
      "column_type": "TEXT",
      "kind": "text",
      "nullable": false,
      "default_value": "strftime('%Y-%m-%d %H:%M:%fZ')"
    },
    {
      "name": "admin2",
      "position": 8,
      "primary_key": false,
      "base_type": "TEXT",
      "column_type": "TEXT",
      "kind": "text",
      "nullable": false,
      "default_value": "''"
    }
  ]
}
neilotoole commented 6 months ago

@c-nv-s: Apologies for the delay in responding, family vacation for spring break.

So, the problem as I see it is that the admin1 field in the cities.json is sometimes non-integer. Specifically, there's at least that one value BRC. So, the sq json ingester switches the kind from int to text, and in the ingest cache DB, the type of the admin1 column will be TEXT. Then, when you try to insert the results from the ingest DB to your existing table, it can't be done, because BRC (text) can't be inserted into an int column.

There's a basic data incompatiblity here. You either need to change the type of your table's @mysqlite.cities.admin1 column to TEXT, or modify the cities.json data to get rid of the non-int values.

Also, something I'm a little confused about... Are you sure that the schema you're showing is for your target table (@mysqlite.cities)? That looks like the schema for the ingest table that sq builds (@cities.data), because it shows admin1 as TEXT). The error:

sq: insert @mysqlite.cities failed: query against @cities: sql: Scan error on column index 0, name "admin1": converting driver.Value type string ("BRC") to a int64: invalid syntax

indicates that the type of @myslite.cities.admin1 is integer, not text, if I'm understanding the situation correctly.

c-nv-s commented 6 months ago

no worries at all, I hope you enjoyed your break.

I did eventually manage to migrate all the data, however, your response leaves me with a couple of doubts about my previous mental model. The target schema is correct and the admin1 field was set as TEXT, so there was definitely confusion when the error message showed it was trying to convert to an INT. I initially assumed that it would first look at the target schema and then attempt to cast/coerce any values it encounters to that target type (that's why I previously suggested you could implement an optimistic version which basically attempts each row insert and just logs/ignores any errors it encounters. And would also prevent wasted time for slow machines to finish ingesting large data only to then receive an error on the insert). However after seeing the error I thought maybe it was just determining the target schema from the first row it ingests and then only after the whole ingestion it performs the insert.

Note that when I split up the file into 2 parts so that second part begins at the BRC row then it does work... which again lead me to suspect that either the target schema was still being determined by first row of the data, or your logic to switch schema types during the ingest phase was not working as intended.

I think I remember also testing it by just creating the target table but not specifying a schema, and in that case it worked, so there might also be a subtle warning about the differences in those situations. So the command does work, I'm happy to close the issue but that's because I understand the "quirks" of the command.

c-nv-s commented 5 months ago

just wanted to mention that a nice "alternative" for this would be if the --insert command accepted input from standard input i.e.

# Insert query results into a table in another data source.
$ sq --insert=@pg1.person '@my1.person | .username, .email'

could be

  # Insert json from stdin into a table in another data source.
  $ echo '{"username":"foobar","email":"foo@bar.com"}' | sq --insert=@pg1.person '@- | .username, .email'