simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.6k stars 690 forks source link

CSV export with ?_labels=on fails at first row with a foreign key that fails to resolve correctly #2214

Closed precipice closed 11 months ago

precipice commented 11 months ago

I'm starting this issue without a clear reproduction in case someone else has seen this behavior, and to use the issue as a notebook for research.

I'm using Datasette with the SWITRS data set, which is a California Highway Patrol collection of traffic incident data from the past decade or so. I receive data from them in CSV and want to work with it in Datasette, then export it to CSV for mapping in Felt.com.

Their data makes extensive use of codes for incident column data (1 for Monday and so on), some of it integer codes and some of it letter/text codes. The text codes are sometimes blank or -. During import, I'm creating lookup tables for foreign key references to make the Datasette UI presentation of the data easier to read.

If I import the data and set up the integer foreign keys, everything works fine, but if I set up the text foreign keys, CSV export starts to fail.

The foreign key configuration is as follows:

# Some tables use integer ids, like sensible tables do. Let's import them first
# since we favor them.

for TABLE in DAY_OF_WEEK CHP_SHIFT POPULATION SPECIAL_COND BEAT_TYPE COLLISION_SEVERITY
do
    sqlite-utils create-table records.db $TABLE id integer name text --pk=id
    sqlite-utils insert records.db $TABLE lookup-tables/$TABLE.csv --csv
    sqlite-utils add-foreign-key records.db collisions $TABLE $TABLE id
    sqlite-utils create-index records.db collisions $TABLE
done

# *Other* tables use letter keys, like they were raised by WOLVES. Let's put them
# at the end of the import queue.

for TABLE in WEATHER_1 WEATHER_2 LOCATION_TYPE RAMP_INTERSECTION SIDE_OF_HWY \
PRIMARY_COLL_FACTOR PCF_CODE_OF_VIOL PCF_VIOL_CATEGORY TYPE_OF_COLLISION MVIW \
PED_ACTION ROAD_SURFACE ROAD_COND_1 ROAD_COND_2 LIGHTING CONTROL_DEVICE \
STWD_VEHTYPE_AT_FAULT CHP_VEHTYPE_AT_FAULT PRIMARY_RAMP SECONDARY_RAMP
do
    sqlite-utils create-table records.db $TABLE key text name text --pk=key
    sqlite-utils insert records.db $TABLE lookup-tables/$TABLE.csv --csv
    sqlite-utils add-foreign-key records.db collisions $TABLE $TABLE key
    sqlite-utils create-index records.db collisions $TABLE
done

You can see the full code and import script here: https://github.com/radical-bike-lobby/switrs-db

If I run this code and then hit the CSV export link in the Datasette interface (the simple link or the "advanced" dialog), export fails after a small number of CSV rows are written. I am not seeing any detailed error messages but this appears in the logging output:

INFO:     127.0.0.1:57885 - "GET /records/collisions.csv?_facet=PRIMARY_RD&PRIMARY_RD=ASHBY+AV&_labels=on&_size=max HTTP/1.1" 200 OK
Caught this error: 

(No other output follows error: other than a blank line.)

I've stared at the rows directly after the error occurs and can't yet see what is causing the problem. I'm going to set up a development environment and see if I get any more detailed error output, and then stare more at some problematic lines to see if I can get a simple reproduction.

precipice commented 11 months ago

If I uncheck expand labels in the Advanced CSV export dialog, the error does not occur. Re-checking that box and re-running the export does cause the error to occur.

CleanShot 2023-12-06 at 23 34 58@2x

precipice commented 11 months ago

I tested a bunch of the text-indexed columns and found two that (in combination) cause the export to fail in a fairly-minimal case. They are RAMP_INTERSECTION and CHP_VEHTYPE_AT_FAULT, both of which are usually an int but sometimes a -.

The table definitions are showing up in Datasette as:

[RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
[CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),

Those definitions are what I would expect them to be. However, if I look at the CHP_VEHTYPE_AT_FAULT lookup table, single-digit numbers are zero-padded in the definition, and are not in the RAMP_INTERSECTION definition. Hmm. I'll also check if the source data zero-pads either of them in use.

I'm going to look at the source data for these cases and then try to produce a much more minimal test case, too.

simonw commented 11 months ago

Demo here: https://rbl-collisions.fly.dev/records/collisions.csv?_labels=on&PRIMARY_RD=ASHBY+AV - compare with https://rbl-collisions.fly.dev/records/collisions?PRIMARY_RD=ASHBY+AV&_size=max

simonw commented 11 months ago

This could be because of "-" v.s. "- " or similar.

simonw commented 11 months ago

Schema of that table:

CREATE TABLE "collisions" (
   [CASE_ID] INTEGER PRIMARY KEY,
   [COLLISION_DATE] TEXT,
   [COLLISION_TIME] INTEGER,
   [OFFICER_ID] TEXT,
   [REPORTING_DISTRICT] TEXT,
   [DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
   [CNTY_CITY_LOC] INTEGER,
   [BEAT_NUMBER] TEXT,
   [PRIMARY_RD] TEXT,
   [SECONDARY_RD] TEXT,
   [DISTANCE] FLOAT,
   [DIRECTION] TEXT,
   [INTERSECTION] TEXT,
   [WEATHER_1] TEXT REFERENCES [WEATHER_1]([key]),
   [WEATHER_2] TEXT REFERENCES [WEATHER_2]([key]),
   [STATE_HWY_IND] TEXT,
   [CALTRANS_COUNTY] TEXT,
   [CALTRANS_DISTRICT] INTEGER,
   [STATE_ROUTE] INTEGER,
   [POSTMILE] FLOAT,
   [LOCATION_TYPE] TEXT REFERENCES [LOCATION_TYPE]([key]),
   [RAMP_INTERSECTION] TEXT REFERENCES [RAMP_INTERSECTION]([key]),
   [SIDE_OF_HWY] TEXT REFERENCES [SIDE_OF_HWY]([key]),
   [TOW_AWAY] TEXT,
   [COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),
   [NUMBER_KILLED] INTEGER,
   [NUMBER_INJURED] INTEGER,
   [PARTY_COUNT] INTEGER,
   [PRIMARY_COLL_FACTOR] TEXT REFERENCES [PRIMARY_COLL_FACTOR]([key]),
   [PCF_VIOL_CATEGORY] TEXT REFERENCES [PCF_VIOL_CATEGORY]([key]),
   [PCF_VIOLATION] INTEGER,
   [PCF_VIOL_SUBSECTION] TEXT,
   [HIT_AND_RUN] TEXT,
   [TYPE_OF_COLLISION] TEXT REFERENCES [TYPE_OF_COLLISION]([key]),
   [MVIW] TEXT REFERENCES [MVIW]([key]),
   [PED_ACTION] TEXT REFERENCES [PED_ACTION]([key]),
   [ROAD_SURFACE] TEXT REFERENCES [ROAD_SURFACE]([key]),
   [ROAD_COND_1] TEXT REFERENCES [ROAD_COND_1]([key]),
   [ROAD_COND_2] TEXT REFERENCES [ROAD_COND_2]([key]),
   [LIGHTING] TEXT REFERENCES [LIGHTING]([key]),
   [CONTROL_DEVICE] TEXT REFERENCES [CONTROL_DEVICE]([key]),
   [PEDESTRIAN_ACCIDENT] TEXT,
   [BICYCLE_ACCIDENT] TEXT,
   [MOTORCYCLE_ACCIDENT] TEXT,
   [TRUCK_ACCIDENT] TEXT,
   [NOT_PRIVATE_PROPERTY] TEXT,
   [ALCOHOL_INVOLVED] TEXT,
   [STWD_VEHTYPE_AT_FAULT] TEXT REFERENCES [STWD_VEHTYPE_AT_FAULT]([key]),
   [CHP_VEHTYPE_AT_FAULT] TEXT REFERENCES [CHP_VEHTYPE_AT_FAULT]([key]),
   [COUNT_SEVERE_INJ] INTEGER,
   [COUNT_VISIBLE_INJ] INTEGER,
   [COUNT_COMPLAINT_PAIN] INTEGER,
   [COUNT_PED_KILLED] INTEGER,
   [COUNT_PED_INJURED] INTEGER,
   [COUNT_BICYCLIST_KILLED] INTEGER,
   [COUNT_BICYCLIST_INJURED] INTEGER,
   [COUNT_MC_KILLED] INTEGER,
   [COUNT_MC_INJURED] INTEGER,
   [PRIMARY_RAMP] TEXT REFERENCES [PRIMARY_RAMP]([key]),
   [SECONDARY_RAMP] TEXT REFERENCES [SECONDARY_RAMP]([key]),
   [LATITUDE] FLOAT,
   [LONGITUDE] FLOAT,
   [ADDRESS] TEXT,
   [SEVERITY_INDEX] TEXT
);
simonw commented 11 months ago

Most keys are text, but some are integers:

   [DAY_OF_WEEK] INTEGER REFERENCES [DAY_OF_WEEK]([id]),
   [COLLISION_SEVERITY] INTEGER REFERENCES [COLLISION_SEVERITY]([id]),

Those look OK to me:

CREATE TABLE [DAY_OF_WEEK] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT
);

CREATE TABLE [COLLISION_SEVERITY] (
   [id] INTEGER PRIMARY KEY,
   [name] TEXT
);

Both of those tables have only integer values:

precipice commented 11 months ago

If it's useful, you can see the integer versus text key split here:

https://github.com/radical-bike-lobby/switrs-db/blob/main/import-bicycle-crashes.sh#L129

simonw commented 11 months ago

Generated this query to try and spot columns that have missing key references:

select 'DAY_OF_WEEK' as column, count(*) from collisions where DAY_OF_WEEK not in (select id from DAY_OF_WEEK) union all
select 'WEATHER_1' as column, count(*) from collisions where WEATHER_1 not in (select key from WEATHER_1) union all
select 'WEATHER_2' as column, count(*) from collisions where WEATHER_2 not in (select key from WEATHER_2) union all
select 'LOCATION_TYPE' as column, count(*) from collisions where LOCATION_TYPE not in (select key from LOCATION_TYPE) union all
select 'RAMP_INTERSECTION' as column, count(*) from collisions where RAMP_INTERSECTION not in (select key from RAMP_INTERSECTION) union all
select 'SIDE_OF_HWY' as column, count(*) from collisions where SIDE_OF_HWY not in (select key from SIDE_OF_HWY) union all
select 'COLLISION_SEVERITY' as column, count(*) from collisions where COLLISION_SEVERITY not in (select id from COLLISION_SEVERITY) union all
select 'PRIMARY_COLL_FACTOR' as column, count(*) from collisions where PRIMARY_COLL_FACTOR not in (select key from PRIMARY_COLL_FACTOR) union all
select 'PCF_VIOL_CATEGORY' as column, count(*) from collisions where PCF_VIOL_CATEGORY not in (select key from PCF_VIOL_CATEGORY) union all
select 'TYPE_OF_COLLISION' as column, count(*) from collisions where TYPE_OF_COLLISION not in (select key from TYPE_OF_COLLISION) union all
select 'MVIW' as column, count(*) from collisions where MVIW not in (select key from MVIW) union all
select 'PED_ACTION' as column, count(*) from collisions where PED_ACTION not in (select key from PED_ACTION) union all
select 'ROAD_SURFACE' as column, count(*) from collisions where ROAD_SURFACE not in (select key from ROAD_SURFACE) union all
select 'ROAD_COND_1' as column, count(*) from collisions where ROAD_COND_1 not in (select key from ROAD_COND_1) union all
select 'ROAD_COND_2' as column, count(*) from collisions where ROAD_COND_2 not in (select key from ROAD_COND_2) union all
select 'LIGHTING' as column, count(*) from collisions where LIGHTING not in (select key from LIGHTING) union all
select 'CONTROL_DEVICE' as column, count(*) from collisions where CONTROL_DEVICE not in (select key from CONTROL_DEVICE) union all
select 'STWD_VEHTYPE_AT_FAULT' as column, count(*) from collisions where STWD_VEHTYPE_AT_FAULT not in (select key from STWD_VEHTYPE_AT_FAULT) union all
select 'CHP_VEHTYPE_AT_FAULT' as column, count(*) from collisions where CHP_VEHTYPE_AT_FAULT not in (select key from CHP_VEHTYPE_AT_FAULT) union all
select 'PRIMARY_RAMP' as column, count(*) from collisions where PRIMARY_RAMP not in (select key from PRIMARY_RAMP) union all
select 'SECONDARY_RAMP' as column, count(*) from collisions where SECONDARY_RAMP not in (select key from SECONDARY_RAMP)

Running that here produces:

column count(*)
DAY_OF_WEEK 0
WEATHER_1 0
WEATHER_2 0
LOCATION_TYPE 0
RAMP_INTERSECTION 14236
SIDE_OF_HWY 0
COLLISION_SEVERITY 0
PRIMARY_COLL_FACTOR 0
PCF_VIOL_CATEGORY 287
TYPE_OF_COLLISION 665
MVIW 17
PED_ACTION 0
ROAD_SURFACE 0
ROAD_COND_1 0
ROAD_COND_2 0
LIGHTING 0
CONTROL_DEVICE 0
STWD_VEHTYPE_AT_FAULT 0
CHP_VEHTYPE_AT_FAULT 1629
PRIMARY_RAMP 0
SECONDARY_RAMP 5
simonw commented 11 months ago

Looking at RAMP_INTERSECTION as an example - that table has empty strings instead of nulls for rows that are missing a value. Maybe that's the source of the Datasette bug?

https://rbl-collisions.fly.dev/records/collisions/4750401.json?_shape=array

[
  {
    "CASE_ID": 4750401,
    "COLLISION_DATE": "2012-04-10",
    "COLLISION_TIME": 1347,
    "OFFICER_ID": "23",
    "REPORTING_DISTRICT": "21",
    "DAY_OF_WEEK": 2,
    "CNTY_CITY_LOC": 103,
    "BEAT_NUMBER": "016",
    "PRIMARY_RD": "UNIVERSITY AV",
    "SECONDARY_RD": "6TH ST",
    "DISTANCE": 0,
    "DIRECTION": "",
    "INTERSECTION": "Y",
    "WEATHER_1": "B",
    "WEATHER_2": "C",
    "STATE_HWY_IND": "N",
    "CALTRANS_COUNTY": "",
    "CALTRANS_DISTRICT": "",
    "STATE_ROUTE": "",
    "POSTMILE": "",
    "LOCATION_TYPE": "",
    "RAMP_INTERSECTION": "",
    "SIDE_OF_HWY": "",
    "TOW_AWAY": "Y",
    "COLLISION_SEVERITY": 4,
    "NUMBER_KILLED": 0,
    "NUMBER_INJURED": 2,
    "PARTY_COUNT": 2,
    "PRIMARY_COLL_FACTOR": "A",
    "PCF_VIOL_CATEGORY": "12",
    "PCF_VIOLATION": 21453,
    "PCF_VIOL_SUBSECTION": "A",
    "HIT_AND_RUN": "N",
    "TYPE_OF_COLLISION": "D",
    "MVIW": "C",
    "PED_ACTION": "A",
    "ROAD_SURFACE": "B",
    "ROAD_COND_1": "H",
    "ROAD_COND_2": "-",
    "LIGHTING": "A",
    "CONTROL_DEVICE": "A",
    "PEDESTRIAN_ACCIDENT": "",
    "BICYCLE_ACCIDENT": "",
    "MOTORCYCLE_ACCIDENT": "",
    "TRUCK_ACCIDENT": "",
    "NOT_PRIVATE_PROPERTY": "Y",
    "ALCOHOL_INVOLVED": "",
    "STWD_VEHTYPE_AT_FAULT": "J",
    "CHP_VEHTYPE_AT_FAULT": "48",
    "COUNT_SEVERE_INJ": 0,
    "COUNT_VISIBLE_INJ": 0,
    "COUNT_COMPLAINT_PAIN": 2,
    "COUNT_PED_KILLED": 0,
    "COUNT_PED_INJURED": 0,
    "COUNT_BICYCLIST_KILLED": 0,
    "COUNT_BICYCLIST_INJURED": 0,
    "COUNT_MC_KILLED": 0,
    "COUNT_MC_INJURED": 0,
    "PRIMARY_RAMP": "-",
    "SECONDARY_RAMP": "-",
    "LATITUDE": "",
    "LONGITUDE": "",
    "ADDRESS": "UNIVERSITY AV and 6TH ST, Berkeley, CA",
    "SEVERITY_INDEX": "4"
  }
]
simonw commented 11 months ago

I grabbed a copy of the https://rbl-collisions.fly.dev/records.db DB and ran it locally. Here's where the error message comes from:

https://github.com/simonw/datasette/blob/4284c74bc133ab494bf4b6dcd4a20b97b05ebb83/datasette/views/base.py#L553-L568

I added a breakpoint() to get a debugger at that point, and found that the error is an assertion error:

560                                 else:
561                                     new_row.append(cell)
562                             await writer.writerow(new_row)
563                 except Exception as ex:
564                     breakpoint()
565  ->                 sys.stderr.write("Caught this error: {}\n".format(ex))
566                     sys.stderr.flush()
567                     await r.write(str(e))
568                     return
569             await limited_writer.write(postamble)
570     
(Pdb) ex
AssertionError()
(Pdb) type(ex)
<class 'AssertionError'>
simonw commented 11 months ago

On a hunch I applied this diff:

diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..000a2e75 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
                                 if cell is None:
                                     new_row.extend(("", ""))
                                 else:
-                                    assert isinstance(cell, dict)
+                                    assert isinstance(
+                                        cell, dict
+                                    ), "cell {} should have been a dict".format(
+                                        repr(cell)
+                                    )
                                     new_row.append(cell["value"])
                                     new_row.append(cell["label"])
                             else:
                                 new_row.append(cell)
                         await writer.writerow(new_row)
-            except Exception as e:
-                sys.stderr.write("Caught this error: {}\n".format(e))
+            except Exception as ex:
+                sys.stderr.write("Caught this error: {}\n".format(ex))
                 sys.stderr.flush()
-                await r.write(str(e))
+                await r.write(str(ex))
                 return
         await limited_writer.write(postamble)

And now the error message that is displayed says:

Caught this error: cell '- ' should have been a dict
simonw commented 11 months ago

Using assertions here is bad in the first place, because they would be ignored if anyone ran Datasette with the python -O option.

simonw commented 11 months ago

So the core bug is here:

https://github.com/simonw/datasette/blob/4284c74bc133ab494bf4b6dcd4a20b97b05ebb83/datasette/views/base.py#L546-L562

But I should clean up the way assertions are used here too.

simonw commented 11 months ago

The row that's breaking looks like this in the debugger:

{'ADDRESS': 'ASHBY AV and CLAIREMONT AV, Berkeley, CA',
 'ALCOHOL_INVOLVED': '',
 'BEAT_NUMBER': '008',
 'BICYCLE_ACCIDENT': '',
 'CALTRANS_COUNTY': 'ALA',
 'CALTRANS_DISTRICT': 4,
 'CASE_ID': 5101393,
 'CHP_VEHTYPE_AT_FAULT': '- ',
 'CNTY_CITY_LOC': 103,

Note CHP_VEHTYPE_AT_FAULT is '- ' - but that's a column which should be expanded as a foreign key.

SQL query:

select
  CHP_VEHTYPE_AT_FAULT, count(*)
from
  collisions
where
  CHP_VEHTYPE_AT_FAULT not in (
    select
      key
    from
      CHP_VEHTYPE_AT_FAULT
  )
group by CHP_VEHTYPE_AT_FAULT

Run here as JSON returns:

[
  {
    "CHP_VEHTYPE_AT_FAULT": "",
    "count(*)": 561
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "- ",
    "count(*)": 1065
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "33",
    "count(*)": 1
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "91",
    "count(*)": 1
  },
  {
    "CHP_VEHTYPE_AT_FAULT": "93",
    "count(*)": 1
  }
]
simonw commented 11 months ago

So I think the root bug here is that Datasette CSV export, with expand labels enabled, fails if any of the rows have a foreign key that doesn't correctly resolve to the other table.

simonw commented 11 months ago

I think the correct behavior here is to output the value column with the value from the table, and leave the label column blank.

I prototyped that up and here's what that export looks like with the CSV opened in Numbers:

CleanShot 2023-12-22 at 14 48 00@2x

Code:

diff --git a/datasette/views/base.py b/datasette/views/base.py
index 0080b33c..9ef403e0 100644
--- a/datasette/views/base.py
+++ b/datasette/views/base.py
@@ -554,16 +554,20 @@ async def stream_csv(datasette, fetch_data, request, database):
                                 if cell is None:
                                     new_row.extend(("", ""))
                                 else:
-                                    assert isinstance(cell, dict)
-                                    new_row.append(cell["value"])
-                                    new_row.append(cell["label"])
+                                    if not isinstance(
+                                        cell, dict
+                                    ):
+                                        new_row.extend((cell, ""))
+                                    else:
+                                        new_row.append(cell["value"])
+                                        new_row.append(cell["label"])
                             else:
                                 new_row.append(cell)
                         await writer.writerow(new_row)
simonw commented 11 months ago

Need test coverage and I'll land this fix.

simonw commented 11 months ago

Wrote this test but it fails when I expect it to pass:

@pytest.mark.asyncio
async def test_table_csv_with_invalid_labels():
    # https://github.com/simonw/datasette/issues/2214
    ds = Datasette()
    await ds.invoke_startup()
    db = ds.add_memory_database("db_2214")
    await db.execute_write_script(
        """
        create table t1 (id integer primary key, name text);
        insert into t1 (id, name) values (1, 'one');
        insert into t1 (id, name) values (2, 'two');
        create table t2 (textid text primary key, value text);
        insert into t2 (textid, value) values ('a', 'alpha');
        insert into t2 (textid, value) values ('b', 'beta');
        create table if not exists maintable (
            id integer primary key,
            fk_integer integer references t2(id),
            fk_text text references t3(textid)
        );
        insert into maintable (id, fk_integer, fk_text) values (1, 1, 'a');
        insert into maintable (id, fk_integer, fk_text) values (2, 3, 'b'); -- invalid fk_integer
        insert into maintable (id, fk_integer, fk_text) values (3, 2, 'c'); -- invalid fk_text
    """
    )
    response = await ds.client.get("/db_2214/maintable.csv?_labels=1")
    assert response.status_code == 200
    assert response.text == (
        "id,fk_integer,fk_integer_label,fk_text,fk_text_label\r\n"
        "1,1,one,a,alpha\r\n"
        "2,3,,b,beta\r\n"
        "3,2,two,c,\r\n"
    )

Might be that the original bug only triggers on another circumstance, e.g. if NONE of the columns resolve correctly perhaps?

simonw commented 11 months ago

I got it to print cell, and saw this:

{'value': 1, 'label': '1'}
{'value': 'a', 'label': 'a'}
{'value': 3, 'label': '3'}
{'value': 'b', 'label': 'b'}
{'value': 2, 'label': '2'}
{'value': 'c', 'label': 'c'}

It was a bug in my test, now fixed.

simonw commented 11 months ago

I'm going to back-port this to the 0.64.x branch and release the fix.

simonw commented 11 months ago

Fix now available in Datasette 0.64.6: https://docs.datasette.io/en/stable/changelog.html#v0-64-6

precipice commented 11 months ago

I installed datasette-0.64.6 and re-ran my test and was able to download the full export with no errors. Thank you so much for your help, @simonw!