Closed precipice closed 11 months ago
If I uncheck expand labels
in the Advanced CSV export dialog, the error does not occur. Re-checking that box and re-running the export does cause the error to occur.
I tested a bunch of the text-indexed columns and found two that (in combination) cause the export to fail in a fairly-minimal case. They are RAMP_INTERSECTION
and CHP_VEHTYPE_AT_FAULT
, both of which are usually an int but sometimes a -
.
The table definitions are showing up in Datasette as:
[RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
[CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),
Those definitions are what I would expect them to be. However, if I look at the CHP_VEHTYPE_AT_FAULT
lookup table, single-digit numbers are zero-padded in the definition, and are not in the RAMP_INTERSECTION
definition. Hmm. I'll also check if the source data zero-pads either of them in use.
I'm going to look at the source data for these cases and then try to produce a much more minimal test case, too.
This could be because of "-"
v.s. "- "
or similar.
Schema of that table:
CREATE TABLE "collisions" (
[CASE_ID] INTEGER PRIMARY KEY,
[COLLISION_DATE] TEXT,
[COLLISION_TIME] INTEGER,
[OFFICER_ID] TEXT,
[REPORTING_DISTRICT] TEXT,
[DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
[CNTY_CITY_LOC] INTEGER,
[BEAT_NUMBER] TEXT,
[PRIMARY_RD] TEXT,
[SECONDARY_RD] TEXT,
[DISTANCE] FLOAT,
[DIRECTION] TEXT,
[INTERSECTION] TEXT,
[WEATHER_1] TEXT REFERENCES [WEATHER_1]([key]),
[WEATHER_2] TEXT REFERENCES [WEATHER_2]([key]),
[STATE_HWY_IND] TEXT,
[CALTRANS_COUNTY] TEXT,
[CALTRANS_DISTRICT] INTEGER,
[STATE_ROUTE] INTEGER,
[POSTMILE] FLOAT,
[LOCATION_TYPE] TEXT REFERENCES [LOCATION_TYPE]([key]),
[RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
[SIDE_OF_HWY] TEXT REFERENCES [SIDE_OF_HWY]([key]),
[TOW_AWAY] TEXT,
[COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),
[NUMBER_KILLED] INTEGER,
[NUMBER_INJURED] INTEGER,
[PARTY_COUNT] INTEGER,
[PRIMARY_COLL_FACTOR] TEXT REFERENCES [PRIMARY_COLL_FACTOR]([key]),
[PCF_VIOL_CATEGORY] TEXT REFERENCES [PCF_VIOL_CATEGORY]([key]),
[PCF_VIOLATION] INTEGER,
[PCF_VIOL_SUBSECTION] TEXT,
[HIT_AND_RUN] TEXT,
[TYPE_OF_COLLISION] TEXT REFERENCES [TYPE_OF_COLLISION]([key]),
[MVIW] TEXT REFERENCES [MVIW]([key]),
[PED_ACTION] TEXT REFERENCES [PED_ACTION]([key]),
[ROAD_SURFACE] TEXT REFERENCES [ROAD_SURFACE]([key]),
[ROAD_COND_1] TEXT REFERENCES [ROAD_COND_1]([key]),
[ROAD_COND_2] TEXT REFERENCES [ROAD_COND_2]([key]),
[LIGHTING] TEXT REFERENCES [LIGHTING]([key]),
[CONTROL_DEVICE] TEXT REFERENCES [CONTROL_DEVICE]([key]),
[PEDESTRIAN_ACCIDENT] TEXT,
[BICYCLE_ACCIDENT] TEXT,
[MOTORCYCLE_ACCIDENT] TEXT,
[TRUCK_ACCIDENT] TEXT,
[NOT_PRIVATE_PROPERTY] TEXT,
[ALCOHOL_INVOLVED] TEXT,
[STWD_VEHTYPE_AT_FAULT] TEXT REFERENCES [STWD_VEHTYPE_AT_FAULT]([key]),
[CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),
[COUNT_SEVERE_INJ] INTEGER,
[COUNT_VISIBLE_INJ] INTEGER,
[COUNT_COMPLAINT_PAIN] INTEGER,
[COUNT_PED_KILLED] INTEGER,
[COUNT_PED_INJURED] INTEGER,
[COUNT_BICYCLIST_KILLED] INTEGER,
[COUNT_BICYCLIST_INJURED] INTEGER,
[COUNT_MC_KILLED] INTEGER,
[COUNT_MC_INJURED] INTEGER,
[PRIMARY_RAMP] TEXT REFERENCES [PRIMARY_RAMP]([key]),
[SECONDARY_RAMP] TEXT REFERENCES [SECONDARY_RAMP]([key]),
[LATITUDE] FLOAT,
[LONGITUDE] FLOAT,
[ADDRESS] TEXT,
[SEVERITY_INDEX] TEXT
);
Most keys are text, but some are integers:
[DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
[COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),
Those look OK to me:
CREATE TABLE [DAY_OF_WEEK] (
[id] INTEGER PRIMARY KEY,
[name] TEXT
);
CREATE TABLE [COLLISION_SEVERITY] (
[id] INTEGER PRIMARY KEY,
[name] TEXT
);
Both of those tables have only integer values:
If it's useful, you can see the integer versus text key split here:
https://github.com/radical-bike-lobby/switrs-db/blob/main/import-bicycle-crashes.sh#L129
Generated this query to try and spot columns that have missing key references:
select 'DAY_OF_WEEK' as column, count(*) from collisions where DAY_OF_WEEK not in (select id from DAY_OF_WEEK) union all
select 'WEATHER_1' as column, count(*) from collisions where WEATHER_1 not in (select key from WEATHER_1) union all
select 'WEATHER_2' as column, count(*) from collisions where WEATHER_2 not in (select key from WEATHER_2) union all
select 'LOCATION_TYPE' as column, count(*) from collisions where LOCATION_TYPE not in (select key from LOCATION_TYPE) union all
select 'RAMP_INTERSECTION' as column, count(*) from collisions where RAMP_INTERSECTION not in (select key from RAMP_INTERSECTION) union all
select 'SIDE_OF_HWY' as column, count(*) from collisions where SIDE_OF_HWY not in (select key from SIDE_OF_HWY) union all
select 'COLLISION_SEVERITY' as column, count(*) from collisions where COLLISION_SEVERITY not in (select id from COLLISION_SEVERITY) union all
select 'PRIMARY_COLL_FACTOR' as column, count(*) from collisions where PRIMARY_COLL_FACTOR not in (select key from PRIMARY_COLL_FACTOR) union all
select 'PCF_VIOL_CATEGORY' as column, count(*) from collisions where PCF_VIOL_CATEGORY not in (select key from PCF_VIOL_CATEGORY) union all
select 'TYPE_OF_COLLISION' as column, count(*) from collisions where TYPE_OF_COLLISION not in (select key from TYPE_OF_COLLISION) union all
select 'MVIW' as column, count(*) from collisions where MVIW not in (select key from MVIW) union all
select 'PED_ACTION' as column, count(*) from collisions where PED_ACTION not in (select key from PED_ACTION) union all
select 'ROAD_SURFACE' as column, count(*) from collisions where ROAD_SURFACE not in (select key from ROAD_SURFACE) union all
select 'ROAD_COND_1' as column, count(*) from collisions where ROAD_COND_1 not in (select key from ROAD_COND_1) union all
select 'ROAD_COND_2' as column, count(*) from collisions where ROAD_COND_2 not in (select key from ROAD_COND_2) union all
select 'LIGHTING' as column, count(*) from collisions where LIGHTING not in (select key from LIGHTING) union all
select 'CONTROL_DEVICE' as column, count(*) from collisions where CONTROL_DEVICE not in (select key from CONTROL_DEVICE) union all
select 'STWD_VEHTYPE_AT_FAULT' as column, count(*) from collisions where STWD_VEHTYPE_AT_FAULT not in (select key from STWD_VEHTYPE_AT_FAULT) union all
select 'CHP_VEHTYPE_AT_FAULT' as column, count(*) from collisions where CHP_VEHTYPE_AT_FAULT not in (select key from CHP_VEHTYPE_AT_FAULT) union all
select 'PRIMARY_RAMP' as column, count(*) from collisions where PRIMARY_RAMP not in (select key from PRIMARY_RAMP) union all
select 'SECONDARY_RAMP' as column, count(*) from collisions where SECONDARY_RAMP not in (select key from SECONDARY_RAMP)
Running that here produces:
column | count(*) |
---|---|
DAY_OF_WEEK | 0 |
WEATHER_1 | 0 |
WEATHER_2 | 0 |
LOCATION_TYPE | 0 |
RAMP_INTERSECTION | 14236 |
SIDE_OF_HWY | 0 |
COLLISION_SEVERITY | 0 |
PRIMARY_COLL_FACTOR | 0 |
PCF_VIOL_CATEGORY | 287 |
TYPE_OF_COLLISION | 665 |
MVIW | 17 |
PED_ACTION | 0 |
ROAD_SURFACE | 0 |
ROAD_COND_1 | 0 |
ROAD_COND_2 | 0 |
LIGHTING | 0 |
CONTROL_DEVICE | 0 |
STWD_VEHTYPE_AT_FAULT | 0 |
CHP_VEHTYPE_AT_FAULT | 1629 |
PRIMARY_RAMP | 0 |
SECONDARY_RAMP | 5 |
Looking at RAMP_INTERSECTION as an example - that table has empty strings instead of nulls for rows that are missing a value. Maybe that's the source of the Datasette bug?
https://rbl-collisions.fly.dev/records/collisions/4750401.json?_shape=array
[
{
"CASE_ID": 4750401,
"COLLISION_DATE": "2012-04-10",
"COLLISION_TIME": 1347,
"OFFICER_ID": "23",
"REPORTING_DISTRICT": "21",
"DAY_OF_WEEK": 2,
"CNTY_CITY_LOC": 103,
"BEAT_NUMBER": "016",
"PRIMARY_RD": "UNIVERSITY AV",
"SECONDARY_RD": "6TH ST",
"DISTANCE": 0,
"DIRECTION": "",
"INTERSECTION": "Y",
"WEATHER_1": "B",
"WEATHER_2": "C",
"STATE_HWY_IND": "N",
"CALTRANS_COUNTY": "",
"CALTRANS_DISTRICT": "",
"STATE_ROUTE": "",
"POSTMILE": "",
"LOCATION_TYPE": "",
"RAMP_INTERSECTION": "",
"SIDE_OF_HWY": "",
"TOW_AWAY": "Y",
"COLLISION_SEVERITY": 4,
"NUMBER_KILLED": 0,
"NUMBER_INJURED": 2,
"PARTY_COUNT": 2,
"PRIMARY_COLL_FACTOR": "A",
"PCF_VIOL_CATEGORY": "12",
"PCF_VIOLATION": 21453,
"PCF_VIOL_SUBSECTION": "A",
"HIT_AND_RUN": "N",
"TYPE_OF_COLLISION": "D",
"MVIW": "C",
"PED_ACTION": "A",
"ROAD_SURFACE": "B",
"ROAD_COND_1": "H",
"ROAD_COND_2": "-",
"LIGHTING": "A",
"CONTROL_DEVICE": "A",
"PEDESTRIAN_ACCIDENT": "",
"BICYCLE_ACCIDENT": "",
"MOTORCYCLE_ACCIDENT": "",
"TRUCK_ACCIDENT": "",
"NOT_PRIVATE_PROPERTY": "Y",
"ALCOHOL_INVOLVED": "",
"STWD_VEHTYPE_AT_FAULT": "J",
"CHP_VEHTYPE_AT_FAULT": "48",
"COUNT_SEVERE_INJ": 0,
"COUNT_VISIBLE_INJ": 0,
"COUNT_COMPLAINT_PAIN": 2,
"COUNT_PED_KILLED": 0,
"COUNT_PED_INJURED": 0,
"COUNT_BICYCLIST_KILLED": 0,
"COUNT_BICYCLIST_INJURED": 0,
"COUNT_MC_KILLED": 0,
"COUNT_MC_INJURED": 0,
"PRIMARY_RAMP": "-",
"SECONDARY_RAMP": "-",
"LATITUDE": "",
"LONGITUDE": "",
"ADDRESS": "UNIVERSITY AV and 6TH ST, Berkeley, CA",
"SEVERITY_INDEX": "4"
}
]
I grabbed a copy of the https://rbl-collisions.fly.dev/records.db DB and ran it locally. Here's where the error message comes from:
I added a breakpoint()
to get a debugger at that point, and found that the error is an assertion error:
560 else:
561 new_row.append(cell)
562 await writer.writerow(new_row)
563 except Exception as ex:
564 breakpoint()
565 -> sys.stderr.write("Caught this error: {}\n".format(ex))
566 sys.stderr.flush()
567 await r.write(str(e))
568 return
569 await limited_writer.write(postamble)
570
(Pdb) ex
AssertionError()
(Pdb) type(ex)
<class 'AssertionError'>
On a hunch I applied this diff:
diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..000a2e75 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
if cell is None:
new_row.extend(("", ""))
else:
- assert isinstance(cell, dict)
+ assert isinstance(
+ cell, dict
+ ), "cell {} should have been a dict".format(
+ repr(cell)
+ )
new_row.append(cell["value"])
new_row.append(cell["label"])
else:
new_row.append(cell)
await writer.writerow(new_row)
- except Exception as e:
- sys.stderr.write("Caught this error: {}\n".format(e))
+ except Exception as ex:
+ sys.stderr.write("Caught this error: {}\n".format(ex))
sys.stderr.flush()
- await r.write(str(e))
+ await r.write(str(ex))
return
await limited_writer.write(postamble)
And now the error message that is displayed says:
Caught this error: cell '- ' should have been a dict
Using assertions here is bad in the first place, because they would be ignored if anyone ran Datasette with the python -O
option.
So the core bug is here:
But I should clean up the way assertions are used here too.
The row that's breaking looks like this in the debugger:
{'ADDRESS': 'ASHBY AV and CLAIREMONT AV, Berkeley, CA',
'ALCOHOL_INVOLVED': '',
'BEAT_NUMBER': '008',
'BICYCLE_ACCIDENT': '',
'CALTRANS_COUNTY': 'ALA',
'CALTRANS_DISTRICT': 4,
'CASE_ID': 5101393,
'CHP_VEHTYPE_AT_FAULT': '- ',
'CNTY_CITY_LOC': 103,
Note CHP_VEHTYPE_AT_FAULT
is '- '
- but that's a column which should be expanded as a foreign key.
SQL query:
select
CHP_VEHTYPE_AT_FAULT, count(*)
from
collisions
where
CHP_VEHTYPE_AT_FAULT not in (
select
key
from
CHP_VEHTYPE_AT_FAULT
)
group by CHP_VEHTYPE_AT_FAULT
Run here as JSON returns:
[
{
"CHP_VEHTYPE_AT_FAULT": "",
"count(*)": 561
},
{
"CHP_VEHTYPE_AT_FAULT": "- ",
"count(*)": 1065
},
{
"CHP_VEHTYPE_AT_FAULT": "33",
"count(*)": 1
},
{
"CHP_VEHTYPE_AT_FAULT": "91",
"count(*)": 1
},
{
"CHP_VEHTYPE_AT_FAULT": "93",
"count(*)": 1
}
]
So I think the root bug here is that Datasette CSV export, with expand labels enabled, fails if any of the rows have a foreign key that doesn't correctly resolve to the other table.
I think the correct behavior here is to output the value column with the value from the table, and leave the label column blank.
I prototyped that up and here's what that export looks like with the CSV opened in Numbers:
Code:
diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..9ef403e0 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
if cell is None:
new_row.extend(("", ""))
else:
- assert isinstance(cell, dict)
- new_row.append(cell["value"])
- new_row.append(cell["label"])
+ if not isinstance(
+ cell, dict
+ ):
+ new_row.extend((cell, ""))
+ else:
+ new_row.append(cell["value"])
+ new_row.append(cell["label"])
else:
new_row.append(cell)
await writer.writerow(new_row)
Need test coverage and I'll land this fix.
Wrote this test but it fails when I expect it to pass:
@pytest.mark.asyncio
async def test_table_csv_with_invalid_labels():
# https://github.com/simonw/datasette/issues/2214
ds = Datasette()
await ds.invoke_startup()
db = ds.add_memory_database("db_2214")
await db.execute_write_script(
"""
create table t1 (id integer primary key, name text);
insert into t1 (id, name) values (1, 'one');
insert into t1 (id, name) values (2, 'two');
create table t2 (textid text primary key, value text);
insert into t2 (textid, value) values ('a', 'alpha');
insert into t2 (textid, value) values ('b', 'beta');
create table if not exists maintable (
id integer primary key,
fk_integer integer references t2(id),
fk_text text references t3(textid)
);
insert into maintable (id, fk_integer, fk_text) values (1, 1, 'a');
insert into maintable (id, fk_integer, fk_text) values (2, 3, 'b'); -- invalid fk_integer
insert into maintable (id, fk_integer, fk_text) values (3, 2, 'c'); -- invalid fk_text
"""
)
response = await ds.client.get("/db_2214/maintable.csv?_labels=1")
assert response.status_code == 200
assert response.text == (
"id,fk_integer,fk_integer_label,fk_text,fk_text_label\r\n"
"1,1,one,a,alpha\r\n"
"2,3,,b,beta\r\n"
"3,2,two,c,\r\n"
)
Might be that the original bug only triggers on another circumstance, e.g. if NONE of the columns resolve correctly perhaps?
I got it to print cell, and saw this:
{'value': 1, 'label': '1'}
{'value': 'a', 'label': 'a'}
{'value': 3, 'label': '3'}
{'value': 'b', 'label': 'b'}
{'value': 2, 'label': '2'}
{'value': 'c', 'label': 'c'}
It was a bug in my test, now fixed.
I'm going to back-port this to the 0.64.x branch and release the fix.
Fix now available in Datasette 0.64.6: https://docs.datasette.io/en/stable/changelog.html#v0-64-6
I installed datasette-0.64.6 and re-ran my test and was able to download the full export with no errors. Thank you so much for your help, @simonw!
I'm starting this issue without a clear reproduction in case someone else has seen this behavior, and to use the issue as a notebook for research.
I'm using Datasette with the SWITRS data set, which is a California Highway Patrol collection of traffic incident data from the past decade or so. I receive data from them in CSV and want to work with it in Datasette, then export it to CSV for mapping in Felt.com.
Their data makes extensive use of codes for incident column data (
1
forMonday
and so on), some of it integer codes and some of it letter/text codes. The text codes are sometimes blank or-
. During import, I'm creating lookup tables for foreign key references to make the Datasette UI presentation of the data easier to read.If I import the data and set up the integer foreign keys, everything works fine, but if I set up the text foreign keys, CSV export starts to fail.
The foreign key configuration is as follows:
You can see the full code and import script here: https://github.com/radical-bike-lobby/switrs-db
If I run this code and then hit the CSV export link in the Datasette interface (the simple link or the "advanced" dialog), export fails after a small number of CSV rows are written. I am not seeing any detailed error messages but this appears in the logging output:
(No other output follows
error:
other than a blank line.)I've stared at the rows directly after the error occurs and can't yet see what is causing the problem. I'm going to set up a development environment and see if I get any more detailed error output, and then stare more at some problematic lines to see if I can get a simple reproduction.