simonw / datasette-app-support

Part of https://github.com/simonw/datasette-app
4 stars 2 forks source link

Importing this CSV from a URL only gets 13 rows, not 596 #23

Open simonw opened 2 years ago

simonw commented 2 years ago

https://raw.githubusercontent.com/okfn/dataportals.org/master/data/portals.csv

The "Open CSV from URL..." menu option only produced 13 rows - but using sqlite-utils insert portals.db portals portals.csv --csv on the command-line got all 596.

simonw commented 2 years ago

Still a bug against latest Datasette Desktop release.

simonw commented 2 years ago

Here's how that CSV file starts:

image

And in Datasette the data cuts off here:

image

Which is right where the first double-newline paragraph break in that CS file occurs.

simonw commented 2 years ago

This is a datasette-app-support problem, moving the issue there.

simonw commented 2 years ago

I'm suspicious of this code: https://github.com/simonw/datasette-app-support/blob/d130884bee3db2b170c661340ca250d8b95d2cfc/datasette_app_support/utils.py#L69-L71

Maybe that AsyncDictReader(response.aiter_lines()) pattern can't cope with CSV files that include their own double-quoted newlines?

simonw commented 2 years ago

This code is also relevant:

https://github.com/simonw/datasette-app-support/blob/d130884bee3db2b170c661340ca250d8b95d2cfc/datasette_app_support/utils.py#L17-L49

simonw commented 2 years ago

Here are my notes from when I wrote that AsyncDictReader class: https://github.com/simonw/datasette-app-support/issues/14#issuecomment-917693618

simonw commented 2 years ago

Maybe AsyncDictReader.__anext__() needs to be smart enough to watch out for unbalanced double quotes and consume another line if it spots one?

simonw commented 2 years ago

https://github.com/MKuranowski/aiocsv may be able to handle this for me.

simonw commented 2 years ago

aiocsv is designed to work with a aiofiles object with a .read() coroutine - I'm not sure how best to map that to an httpx streaming response.

simonw commented 2 years ago

I'm beginning to think it would be better for the app to either suck the entire CSV file into memory OR to save it to a temporary file on disk, then read it into a table. Much simpler that way - this problem with newlines has made me very suspicious of importers that don't directly use csv as it was intended to be used.

simonw commented 2 years ago

I'm going to go with the memory option. Datasette Desktop runs on Macs with a decent amount of RAM, and with swap.