Closed simonw closed 2 years ago
The R code tipped me off to some very interesting HTTP headers:
~ % curl -s -i 'https://soda.demo.socrata.com/resource/4334-bgaj.csv' | more
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 08 Jun 2022 15:27:50 GMT
Content-Type: text/csv; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Access-Control-Allow-Origin: *
ETag: W/"YWxwaGEuNDQ0ODVfM180ODlGM3lneE10dkVrN2xyZDZaOTA5bVc1MGMtTQ---gzityd_r99IXNT3udfjmA3HF-r_x3Y--gzip--gzip"
X-SODA2-Fields: ["source","earthquake_id","version","datetime","magnitude","depth","number_of_stations","region","location"]
X-SODA2-Types: ["text","text","text","floating_timestamp","number","number","number","text","location"]
X-SODA2-Data-Out-Of-Date: false
X-SODA2-Truth-Last-Modified: Thu, 28 Mar 2019 05:43:31 GMT
X-SODA2-Secondary-Last-Modified: Thu, 28 Mar 2019 05:43:31 GMT
Last-Modified: Thu, 28 Mar 2019 05:43:31 GMT
Vary: Accept-Encoding
Age: 0
X-Socrata-Region: aws-us-east-1-fedramp-prod
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Socrata-RequestId: 7fa7ab1669297e734b4ed4d23212acac
"source","earthquake_id","version","datetime","magnitude","depth","number_of_stations","region","location"
"uu","09101857","2","2012-09-10T18:57:30.000","2.2","0.90","12","Utah","(37.1587, -112.7942)"
"ak","10555601","2","2012-09-10T13:16:13.000","1.1","11.60","10","Central Alaska","(63.1085, -151.4938)"
These in particular:
X-SODA2-Fields: ["source","earthquake_id","version","datetime","magnitude","depth","number_of_stations","region","location"]
X-SODA2-Types: ["text","text","text","floating_timestamp","number","number","number","text","location"]
I can use those to import using the CSV file - which should give me all of the data in one go without needing to paginate - while styll storing the data in SQLite using the correct types.
I'm going to have this plugin offer a CLI interface as well, partly to exercise that plugin hook:
datasette socrata import data.db xxxx-xxxx -t table-name
https://stackoverflow.com/questions/24745820/socrata-get-data-types-of-dataset-columns has a note from 2014 that says "I'm checking with our engineering team to see whether X-SODA2-Types
is really deprecated, as per the warning." which is a little bit alarming!
https://data.lacity.org/api/id/yv23-pmwf.json returns 1000 JSON records. https://data.lacity.org/api/id/yv23-pmwf.csv also only returned 1000, which is worrying.
Also from that post: https://data.lacity.org/api/views/yv23-pmwf.json gives the metadata for that dataset.
Interesting to note that data includes detailed column information:
"columns": [
{
"id": 422920034,
"name": "Assessor Book",
"dataTypeName": "text",
"description": "\"Book\" portion of the Los Angeles County Tax Assessor \"Book-Page-Parcel\" number.",
"fieldName": "assessor_book",
"position": 1,
"renderTypeName": "text",
I could use that to populate Datasette's own column metadata feature.
https://data.lacity.org/City-Infrastructure-Service-Requests/Building-and-Safety-Permit-Information-Old/yv23-pmwf claims 1.47 million rows.
This URL DOES return them all - but the headers do not include those column types:
% curl -i 'https://data.lacity.org/api/views/yv23-pmwf/rows.csv' | more
Server: nginx
Date: Wed, 08 Jun 2022 15:43:38 GMT
Content-Type: text/csv; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Access-Control-Allow-Origin: *
Content-disposition: attachment; filename=Building_and_Safety_Permit_Information_Old.csv
Cache-Control: public, must-revalidate, max-age=21600
ETag: W/"YWxwaGEuMTE0NzZfMTFfMzYyMzJZMzJCY2tJeGw3bHhKWndGMEo2QjJxTkk2Sm8---gziprbqvNybkDQANU177Xuw32obgAxg--gzip--gzip"
X-SODA2-Data-Out-Of-Date: false
X-SODA2-Truth-Last-Modified: Mon, 06 Jun 2022 09:30:01 GMT
X-SODA2-Secondary-Last-Modified: Mon, 06 Jun 2022 09:30:01 GMT
Last-Modified: Mon, 06 Jun 2022 09:30:01 GMT
Vary: Accept-Encoding
Age: 0
X-Socrata-Region: aws-us-east-1-fedramp-prod
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Socrata-RequestId: 64bec141c81631817be15e3e9f294a29
So maybe I need to do a hit to get the column types, then a second hit to fetch the actual data.
Here's a tool that uses that rows.csv
endpoint to import all of the data for a table, in this case into Data.World:
This is frustrating: https://dev.socrata.com/docs/datatypes/ shows only one data type for numbers, called number
- so it looks like there's no distinction in Socrata world between integers and floating points.
But I really want to be able to store data as integer columns if it makes sense to do so - I don't want to use floating point for anything that might never use it.
... on that basis, I'm not going to use the type information. I'll use the "detect types" mechanism in sqlite-utils
instead: https://github.com/simonw/sqlite-utils/blob/9fedfc69d7239ac49900051e1c48ee9cdd470d9e/sqlite_utils/utils.py#L222-L235
That's annoying: the CSV export doesn't include a content length indication, so it's not possible to either show a progress bar OR run a validation to ensure the user has enough space on their instance before the download starts.
~ % curl -i https://data.bayareametro.gov/api/views/naaa-uryw/rows.csv | head -n 30
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0HTTP/1.1 200 OK
Server: nginx
Date: Wed, 08 Jun 2022 17:27:39 GMT
Content-Type: text/csv; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Access-Control-Allow-Origin: *
Content-disposition: attachment; filename=Census_Block_Groups.csv
Cache-Control: public, must-revalidate, max-age=21600
ETag: W/"YWxwaGEuMzA4OTk4XzFfMjViamFVclBXUFZ1eGRJanlqQ3ZMZ2FDQVFyZnM---gzirB8dRypPgBZv5HeHKn8cn5NIBJFA--gzip--gzip"
X-SODA2-Data-Out-Of-Date: false
X-SODA2-Truth-Last-Modified: Sat, 07 Aug 2021 16:57:29 GMT
X-SODA2-Secondary-Last-Modified: Sat, 07 Aug 2021 16:57:29 GMT
Last-Modified: Sat, 07 Aug 2021 16:57:29 GMT
Vary: Accept-Encoding
Age: 0
X-Socrata-Region: aws-us-east-1-fedramp-prod
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Socrata-RequestId: 74912274691ecc3b84470207f952e136
I just pulled that Census_Block_Groups.csv file and it turned out to be over 1GB!
https://data.bayareametro.gov/api/views/naaa-uryw.json is the metadata for that. No indications of size that I can see.
https://data.bayareametro.gov/dataset/Parcels/4cta-duuv shows a visualization - it's a lot of VERY detailed tiny polygons.
I'm going to try streaming the incoming CSV directly into the SQLite table while also keeping a count of how many bytes have been consumed - if that count goes above a certain limit I'll cut off the import, show a error message and store a record in a database not to attempt to load that table again.
But... I'm going to leave the imported rows there. The table metadata will show the warning that the table was not fully imported.
(For Datasette Cloud I need to check what happens if someone completely fills their volume - will their container still be able to boot? Hopefully running the import inside a transaction which can be rolled back on error will help avoid breaking things there.)
I'm going to have an option that can check for available disk space while running the import.
>>> shutil.disk_usage("/mnt")
usage(total=1035952128, used=8372224, free=957325312)
I'm going to try inserting 100 records at a time and checking available disk space after each commit.
Just the ID on its own is not enough - you always also need to know the domain. So I'm going to default to asking for a URL that I can extract that domain from. I'll also have a mode where you can pass domain=
and id=
(instead of url=
) which I can use for interfaces that start at a search from elsewhere.
Quick check that the link
ALWAYS includes the domain
- it does: https://open-data.datasette.io/socrata?sql=select+*+from+resources+where+link+not+like+%27%25%27+%7C%7C+domain+%7C%7C+%27%25%27%3B returns 0 results.
Confirmed that 0 links start with http://
as opposed to https://
: https://open-data.datasette.io/socrata?sql=select+*+from+resources+where+link+like+%27http%3A%25%27
An incorrect ID such as https://data.lacity.org/api/views/yv23-pmzf.json returns a 404 error (and a JSON error document).
I'm going to take some hints from the implementation of https://github.com/simonw/datasette-import-table/blob/0.3/datasette_import_table/__init__.py - which iterates through another Datasette page by page in an asyncio
task so it can keep running even after the initial response has been returned.
I'd forgotten how good the implementation of import table was!
One catch: since I don't know the size of the data I'm importing here I can't show a progress bar, which is really annoying.
I can use this to get the row count: https://data.cityofchicago.org/resource/tt4n-kn4t.json?$select=count(*)
[{"count":"31101"}]
Looks like the count sometimes comes back like this: https://data.edmonton.ca/resource/24uj-dj8v.json?$select=count(*)
[{"count_1": "179119"}]
For incrementally importing a CSV from a URL I can use this code: https://github.com/simonw/datasettecloud-datasette/blob/484d4b8c61037349f965a7ee8794ae3b33221f48/plugins/importers.py
I think the datasette-app-support
implementation is better.
Weird, my prototype is giving me this OperationalError('disk I/O error')>
error:
disk I/O error
Task exception was never retrieved
future: <Task finished name='Task-42' coro=<socrata.<locals>.run_the_import() done, defined at /Users/simon/Dropbox/Development/datasette-socrata/datasette_socrata/__init__.py:125> exception=OperationalError('disk I/O error')>
Traceback (most recent call last):
File "/Users/simon/Dropbox/Development/datasette-socrata/datasette_socrata/__init__.py", line 146, in run_the_import
await write_batch(batch)
File "/Users/simon/Dropbox/Development/datasette-socrata/datasette_socrata/__init__.py", line 136, in write_batch
return await database.execute_write_fn(_write, block=True)
File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/datasette/database.py", line 150, in execute_write_fn
raise result
File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/datasette/database.py", line 172, in _execute_writes
result = task.fn(conn)
File "/Users/simon/Dropbox/Development/datasette-socrata/datasette_socrata/__init__.py", line 132, in _write
db[table_name].insert_all(rows)
File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/sqlite_utils/db.py", line 2956, in insert_all
self.insert_chunk(
File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/sqlite_utils/db.py", line 2721, in insert_chunk
result = self.db.execute(query, params)
File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/sqlite_utils/db.py", line 463, in execute
return self.conn.execute(sql, parameters)
I'm going to build a wheel and push it to S3 so I can test this out.
The fix for that OperationalError('disk I/O error')>
error turned out to be enabling WAL mode, which I'm currently doing in the startup hook:
I deployed and tested this and it works! More work to follow in other tickets.
The first version of this plugin will allow users to import a Socrata dataset by entering the ID for that dataset - or pasting in a URL that includes the ID.
Socrata IDs are always
xxxx-xxxx
- four characters, a hyphen, four more characters. These are known as 4x4s - I found that in https://cran.r-project.org/web/packages/RSocrata/RSocrata.pdfTheir code is useful: https://github.com/Chicago/RSocrata/blob/3cd2e2be8515291e23b1175684cd85f2333532e5/R/RSocrata.R