Open c-nv-s opened 7 months ago
@c-nv-s Thanks, I'm looking into this now. What would I do without you? 🙏
@c-nv-s I think the issue is tackled in v0.48.3
. Can you give it a whirl?
@c-nv-s... Hmmmn, I'm seeing different behavior.
$ sq inspect @cities
SOURCE DRIVER NAME FQ NAME SIZE TABLES VIEWS LOCATION
@cities json cities.large.json cities.large.json 19.9MB 1 0 /Users/neilotoole/work/sq/sq/drivers/json/testdata/cities.large.json
NAME TYPE ROWS COLS
data table 146994 name, lat, lng, country, admin1, admin2
$ sq '@cities.data | .[]'
name lat lng country admin1 admin2
Sant Julià de Lòria 42.46372 1.49129 AD 6 NULL
Pas de la Casa 42.54277 1.73361 AD 3 NULL
[SNIP]
So, @cities.data | .[]
command works for me... We need to figure out why it's not working for you. I'll come back to that shortly.
$ sq '@cities.data | .'
So, that's not currently valid sq
syntax. You just need to do @cities.data
. The last segment (| .
) is the column selector segment, where typically you'd have .name, .country
etc. So, .
doesn't work. BUT... now that I see you instinctively using .
, I think I might add that, where .
would just be an identity operator, effectively selecting all columns.
Also, the jsonl
example works just fine for me too.
So, back to the hanging part.
What happens when you do the following (note that I've named the file cities.large.json
because I also have a smaller version for debugging):
$ cat cities.large.json | sq '.data | .[0:3]'
This is what I see:
And then after processing, I get the expected result.
To debug this, can you provide the output of:
sq config ls -vt
sq version -y
We'll get to the bottom of this!
sorry neil please disregard that last comment, I deleted it because I was testing it in a container that was still on v47 I'm currently rerunning the operation with v48 so will let you know if it succeeds... it is taking its time
@c-nv-s If running in a container, you probably won't see the progress bar I suppose? And yes, it does take its time to ingest (although thankfully this is now a one-time cost with the ingest caching that was implemented a few versions ago). It takes about 35s on my beefy macbook, I imagine the container could be significantly less beefy?
Having been digging around in the JSON ingest process recently, I wouldn't be surprised if I could speed it up by an order of magnitude. That's a story for another day though.
I do see the progress bar, that is working fine, and yes the container is resource limited so patience is required. performance is not a high priority requirement at present but a future review would obviously be welcomed.
@c-nv-s The JSON ingester definitely needs some serious love. Compare it with the CSV ingester (which admittedly is the simplest possible data format):
$ time cat cities.large.json | sq '.data | count'
count
146994
cat cities.large.json 0.00s user 0.03s system 0% cpu 34.013 total
sq '.data | count' 3.27s user 26.91s system 88% cpu 34.154 total
$ time cat cities.large.csv | sq '.data | count'
count
146994
cat cities.large.csv 0.00s user 0.00s system 2% cpu 0.271 total
sq '.data | count' 0.46s user 0.06s system 184% cpu 0.283 total
So, that's 34 seconds for JSON, and 0.28 seconds for CSV... 😬
There is a lot of low-hanging fruit for performance improvements in the JSON ingester. But the main reason it's so slow is that it's designed to deal with any JSON that's thrown at it: nested objects, arrays, fields that change type (e.g. from integer to text, which actually happens in that cities.json file). If the ingester knew ahead of time that the JSON was flat, fields were always the same type, etc, then it could go way faster.
I'm certain there's a good strategy to be found that combines the best of both worlds, but I haven't spent any time on it yet. Plus, I figured you'd rather see #143 happen first...
yes, sadly it looks like I'm going to have to leave this operation running overnight to see the outcome which is a shame but not the end of the world. You are correct that right know 143 is still top of my wishlist and if this current fix works that will more than suffice for now, however as an alternative for the "safe ingest" approach maybe you could offer a type of "streaming ingest/insert" which ignores/logs errors for any schema mismatch if that makes sense?
FYI it looks like the import of the jsonl still doesn't work. As you mentioned, the determination of the schema type is tricky because it changes from int64
to string
but I haven't looked into how you've implemented any fallback.
sq '@cities.data | .[]' --insert @mysqlite.cities
sq: insert @mysqlite.cities failed: query against @cities: sql: Scan error on column index 0, name "admin1": converting driver.Value type string ("BRC") to a int64: invalid syntax
The schema on the sql was created before the attempt to import the data (see below for schema)
To avoid Null Entry
errors I also added default values of "00"
for fields .admin1 and .admin2
where a value didn't exist:
cat cities.json | jq -c '.[] | def n: if . == "" then null else . end ; .admin1 = ((.admin1|n) // "00") | .admin2 = ((.admin2|n) // "00")' > cities.jsonl
the existing schema for the sqlite table on my system looks like this:
{
"name": "cities",
"name_fq": "main.cities",
"table_type": "table",
"table_type_db": "table",
"row_count": 0,
"columns": [
{
"name": "admin1",
"position": 0,
"primary_key": false,
"base_type": "TEXT",
"column_type": "TEXT",
"kind": "text",
"nullable": false,
"default_value": "''"
},
{
"name": "country",
"position": 1,
"primary_key": false,
"base_type": "TEXT",
"column_type": "TEXT",
"kind": "text",
"nullable": false,
"default_value": "''"
},
{
"name": "created",
"position": 2,
"primary_key": false,
"base_type": "TEXT",
"column_type": "TEXT",
"kind": "text",
"nullable": false,
"default_value": "strftime('%Y-%m-%d %H:%M:%fZ')"
},
{
"name": "id",
"position": 3,
"primary_key": true,
"base_type": "TEXT",
"column_type": "TEXT",
"kind": "text",
"nullable": false,
"default_value": "'r'||lower(hex(randomblob(7)))"
},
{ [2/1933]
"name": "lat",
"position": 4,
"primary_key": false,
"base_type": "NUMERIC",
"column_type": "NUMERIC",
"kind": "decimal",
"nullable": false,
"default_value": "0"
},
{
"name": "lng",
"position": 5,
"primary_key": false,
"base_type": "NUMERIC",
"column_type": "NUMERIC",
"kind": "decimal",
"nullable": false,
"default_value": "0"
},
{
"name": "name",
"position": 6,
"primary_key": false,
"base_type": "TEXT",
"column_type": "TEXT",
"kind": "text",
"nullable": false,
"default_value": "''"
},
{
"name": "updated",
"position": 7,
"primary_key": false,
"base_type": "TEXT",
"column_type": "TEXT",
"kind": "text",
"nullable": false,
"default_value": "strftime('%Y-%m-%d %H:%M:%fZ')"
},
{
"name": "admin2",
"position": 8,
"primary_key": false,
"base_type": "TEXT",
"column_type": "TEXT",
"kind": "text",
"nullable": false,
"default_value": "''"
}
]
}
@c-nv-s: Apologies for the delay in responding, family vacation for spring break.
So, the problem as I see it is that the admin1
field in the cities.json
is sometimes non-integer. Specifically, there's at least that one value BRC
. So, the sq
json ingester switches the kind from int
to text
, and in the ingest cache DB, the type of the admin1
column will be TEXT
. Then, when you try to insert the results from the ingest DB to your existing table, it can't be done, because BRC
(text) can't be inserted into an int
column.
There's a basic data incompatiblity here. You either need to change the type of your table's @mysqlite.cities.admin1
column to TEXT
, or modify the cities.json
data to get rid of the non-int values.
Also, something I'm a little confused about... Are you sure that the schema you're showing is for your target table (@mysqlite.cities
)? That looks like the schema for the ingest table that sq builds (@cities.data
), because it shows admin1
as TEXT
). The error:
sq: insert @mysqlite.cities failed: query against @cities: sql: Scan error on column index 0, name "admin1": converting driver.Value type string ("BRC") to a int64: invalid syntax
indicates that the type of @myslite.cities.admin1
is integer, not text, if I'm understanding the situation correctly.
no worries at all, I hope you enjoyed your break.
I did eventually manage to migrate all the data, however, your response leaves me with a couple of doubts about my previous mental model. The target schema is correct and the admin1 field was set as TEXT, so there was definitely confusion when the error message showed it was trying to convert to an INT. I initially assumed that it would first look at the target schema and then attempt to cast/coerce any values it encounters to that target type (that's why I previously suggested you could implement an optimistic version which basically attempts each row insert and just logs/ignores any errors it encounters. And would also prevent wasted time for slow machines to finish ingesting large data only to then receive an error on the insert). However after seeing the error I thought maybe it was just determining the target schema from the first row it ingests and then only after the whole ingestion it performs the insert.
Note that when I split up the file into 2 parts so that second part begins at the BRC row then it does work... which again lead me to suspect that either the target schema was still being determined by first row of the data, or your logic to switch schema types during the ingest phase was not working as intended.
I think I remember also testing it by just creating the target table but not specifying a schema, and in that case it worked, so there might also be a subtle warning about the differences in those situations. So the command does work, I'm happy to close the issue but that's because I understand the "quirks" of the command.
just wanted to mention that a nice "alternative" for this would be if the --insert
command accepted input from standard input i.e.
# Insert query results into a table in another data source.
$ sq --insert=@pg1.person '@my1.person | .username, .email'
could be
# Insert json from stdin into a table in another data source.
$ echo '{"username":"foobar","email":"foo@bar.com"}' | sq --insert=@pg1.person '@- | .username, .email'
Describe the bug I was preparing to copy a json file to sqlite (similar to https://sq.io/docs/cookbook#import-json-array-to-database ) but could not because I encountered some buggy behaviour. sample data: https://github.com/lutangar/cities.json/raw/master/cities.json
To Reproduce
Expected behavior
Not actually sure to be honest, but definitely not what was received haha
sq version
"version": "v0.47.2", "commit": "135318f", "timestamp": "2024-01-30T05:13:40Z", "latest_version": "v0.48.1",