simonw / datasette-socrata

Import data from Socrata into Datasette
Apache License 2.0
0 stars 1 forks source link

Initial plugin #1

Closed simonw closed 2 years ago

simonw commented 2 years ago

The first version of this plugin will allow users to import a Socrata dataset by entering the ID for that dataset - or pasting in a URL that includes the ID.

Socrata IDs are always xxxx-xxxx - four characters, a hyphen, four more characters. These are known as 4x4s - I found that in https://cran.r-project.org/web/packages/RSocrata/RSocrata.pdf

Their code is useful: https://github.com/Chicago/RSocrata/blob/3cd2e2be8515291e23b1175684cd85f2333532e5/R/RSocrata.R

simonw commented 2 years ago

The R code tipped me off to some very interesting HTTP headers:

~ % curl -s -i 'https://soda.demo.socrata.com/resource/4334-bgaj.csv' | more
HTTP/1.1 200 OK
Server: nginx
Date: Wed, 08 Jun 2022 15:27:50 GMT
Content-Type: text/csv; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Access-Control-Allow-Origin: *
ETag: W/"YWxwaGEuNDQ0ODVfM180ODlGM3lneE10dkVrN2xyZDZaOTA5bVc1MGMtTQ---gzityd_r99IXNT3udfjmA3HF-r_x3Y--gzip--gzip"
X-SODA2-Fields: ["source","earthquake_id","version","datetime","magnitude","depth","number_of_stations","region","location"]
X-SODA2-Types: ["text","text","text","floating_timestamp","number","number","number","text","location"]
X-SODA2-Data-Out-Of-Date: false
X-SODA2-Truth-Last-Modified: Thu, 28 Mar 2019 05:43:31 GMT
X-SODA2-Secondary-Last-Modified: Thu, 28 Mar 2019 05:43:31 GMT
Last-Modified: Thu, 28 Mar 2019 05:43:31 GMT
Vary: Accept-Encoding
Age: 0
X-Socrata-Region: aws-us-east-1-fedramp-prod
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Socrata-RequestId: 7fa7ab1669297e734b4ed4d23212acac

"source","earthquake_id","version","datetime","magnitude","depth","number_of_stations","region","location"
"uu","09101857","2","2012-09-10T18:57:30.000","2.2","0.90","12","Utah","(37.1587, -112.7942)"
"ak","10555601","2","2012-09-10T13:16:13.000","1.1","11.60","10","Central Alaska","(63.1085, -151.4938)"

These in particular:

X-SODA2-Fields: ["source","earthquake_id","version","datetime","magnitude","depth","number_of_stations","region","location"]
X-SODA2-Types: ["text","text","text","floating_timestamp","number","number","number","text","location"]

I can use those to import using the CSV file - which should give me all of the data in one go without needing to paginate - while styll storing the data in SQLite using the correct types.

simonw commented 2 years ago

I'm going to have this plugin offer a CLI interface as well, partly to exercise that plugin hook:

datasette socrata import data.db xxxx-xxxx -t table-name
simonw commented 2 years ago

https://stackoverflow.com/questions/24745820/socrata-get-data-types-of-dataset-columns has a note from 2014 that says "I'm checking with our engineering team to see whether X-SODA2-Types is really deprecated, as per the warning." which is a little bit alarming!

https://data.lacity.org/api/id/yv23-pmwf.json returns 1000 JSON records. https://data.lacity.org/api/id/yv23-pmwf.csv also only returned 1000, which is worrying.

Also from that post: https://data.lacity.org/api/views/yv23-pmwf.json gives the metadata for that dataset.

Interesting to note that data includes detailed column information:

  "columns": [
    {
      "id": 422920034,
      "name": "Assessor Book",
      "dataTypeName": "text",
      "description": "\"Book\" portion of the Los Angeles County Tax Assessor \"Book-Page-Parcel\" number.",
      "fieldName": "assessor_book",
      "position": 1,
      "renderTypeName": "text",

I could use that to populate Datasette's own column metadata feature.

simonw commented 2 years ago

https://data.lacity.org/City-Infrastructure-Service-Requests/Building-and-Safety-Permit-Information-Old/yv23-pmwf claims 1.47 million rows.

This URL DOES return them all - but the headers do not include those column types:

% curl -i 'https://data.lacity.org/api/views/yv23-pmwf/rows.csv' | more
Server: nginx
Date: Wed, 08 Jun 2022 15:43:38 GMT
Content-Type: text/csv; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Access-Control-Allow-Origin: *
Content-disposition: attachment; filename=Building_and_Safety_Permit_Information_Old.csv
Cache-Control: public, must-revalidate, max-age=21600
ETag: W/"YWxwaGEuMTE0NzZfMTFfMzYyMzJZMzJCY2tJeGw3bHhKWndGMEo2QjJxTkk2Sm8---gziprbqvNybkDQANU177Xuw32obgAxg--gzip--gzip"
X-SODA2-Data-Out-Of-Date: false
X-SODA2-Truth-Last-Modified: Mon, 06 Jun 2022 09:30:01 GMT
X-SODA2-Secondary-Last-Modified: Mon, 06 Jun 2022 09:30:01 GMT
Last-Modified: Mon, 06 Jun 2022 09:30:01 GMT
Vary: Accept-Encoding
Age: 0
X-Socrata-Region: aws-us-east-1-fedramp-prod
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Socrata-RequestId: 64bec141c81631817be15e3e9f294a29

So maybe I need to do a hit to get the column types, then a second hit to fetch the actual data.

simonw commented 2 years ago

Here's a tool that uses that rows.csv endpoint to import all of the data for a table, in this case into Data.World:

https://github.com/hardworkingcoder/socrata-to-datadotworld

simonw commented 2 years ago

This is frustrating: https://dev.socrata.com/docs/datatypes/ shows only one data type for numbers, called number - so it looks like there's no distinction in Socrata world between integers and floating points.

But I really want to be able to store data as integer columns if it makes sense to do so - I don't want to use floating point for anything that might never use it.

simonw commented 2 years ago

... on that basis, I'm not going to use the type information. I'll use the "detect types" mechanism in sqlite-utils instead: https://github.com/simonw/sqlite-utils/blob/9fedfc69d7239ac49900051e1c48ee9cdd470d9e/sqlite_utils/utils.py#L222-L235

simonw commented 2 years ago

That's annoying: the CSV export doesn't include a content length indication, so it's not possible to either show a progress bar OR run a validation to ensure the user has enough space on their instance before the download starts.

~ % curl -i https://data.bayareametro.gov/api/views/naaa-uryw/rows.csv | head -n 30
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0HTTP/1.1 200 OK
Server: nginx
Date: Wed, 08 Jun 2022 17:27:39 GMT
Content-Type: text/csv; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Access-Control-Allow-Origin: *
Content-disposition: attachment; filename=Census_Block_Groups.csv
Cache-Control: public, must-revalidate, max-age=21600
ETag: W/"YWxwaGEuMzA4OTk4XzFfMjViamFVclBXUFZ1eGRJanlqQ3ZMZ2FDQVFyZnM---gzirB8dRypPgBZv5HeHKn8cn5NIBJFA--gzip--gzip"
X-SODA2-Data-Out-Of-Date: false
X-SODA2-Truth-Last-Modified: Sat, 07 Aug 2021 16:57:29 GMT
X-SODA2-Secondary-Last-Modified: Sat, 07 Aug 2021 16:57:29 GMT
Last-Modified: Sat, 07 Aug 2021 16:57:29 GMT
Vary: Accept-Encoding
Age: 0
X-Socrata-Region: aws-us-east-1-fedramp-prod
Strict-Transport-Security: max-age=31536000; includeSubDomains
X-Socrata-RequestId: 74912274691ecc3b84470207f952e136

I just pulled that Census_Block_Groups.csv file and it turned out to be over 1GB!

https://data.bayareametro.gov/api/views/naaa-uryw.json is the metadata for that. No indications of size that I can see.

https://data.bayareametro.gov/dataset/Parcels/4cta-duuv shows a visualization - it's a lot of VERY detailed tiny polygons.

simonw commented 2 years ago

I'm going to try streaming the incoming CSV directly into the SQLite table while also keeping a count of how many bytes have been consumed - if that count goes above a certain limit I'll cut off the import, show a error message and store a record in a database not to attempt to load that table again.

But... I'm going to leave the imported rows there. The table metadata will show the warning that the table was not fully imported.

simonw commented 2 years ago

(For Datasette Cloud I need to check what happens if someone completely fills their volume - will their container still be able to boot? Hopefully running the import inside a transaction which can be rolled back on error will help avoid breaking things there.)

simonw commented 2 years ago

I'm going to have an option that can check for available disk space while running the import.

>>> shutil.disk_usage("/mnt")
usage(total=1035952128, used=8372224, free=957325312)

I'm going to try inserting 100 records at a time and checking available disk space after each commit.

simonw commented 2 years ago

Just the ID on its own is not enough - you always also need to know the domain. So I'm going to default to asking for a URL that I can extract that domain from. I'll also have a mode where you can pass domain= and id= (instead of url=) which I can use for interfaces that start at a search from elsewhere.

simonw commented 2 years ago

Quick check that the link ALWAYS includes the domain - it does: https://open-data.datasette.io/socrata?sql=select+*+from+resources+where+link+not+like+%27%25%27+%7C%7C+domain+%7C%7C+%27%25%27%3B returns 0 results.

simonw commented 2 years ago

Confirmed that 0 links start with http:// as opposed to https://: https://open-data.datasette.io/socrata?sql=select+*+from+resources+where+link+like+%27http%3A%25%27

simonw commented 2 years ago

An incorrect ID such as https://data.lacity.org/api/views/yv23-pmzf.json returns a 404 error (and a JSON error document).

simonw commented 2 years ago

I'm going to take some hints from the implementation of https://github.com/simonw/datasette-import-table/blob/0.3/datasette_import_table/__init__.py - which iterates through another Datasette page by page in an asyncio task so it can keep running even after the initial response has been returned.

simonw commented 2 years ago

I'd forgotten how good the implementation of import table was!

shrink

One catch: since I don't know the size of the data I'm importing here I can't show a progress bar, which is really annoying.

simonw commented 2 years ago

I can use this to get the row count: https://data.cityofchicago.org/resource/tt4n-kn4t.json?$select=count(*)

[{"count":"31101"}]
simonw commented 2 years ago

Looks like the count sometimes comes back like this: https://data.edmonton.ca/resource/24uj-dj8v.json?$select=count(*)

[{"count_1": "179119"}]
simonw commented 2 years ago

For incrementally importing a CSV from a URL I can use this code: https://github.com/simonw/datasettecloud-datasette/blob/484d4b8c61037349f965a7ee8794ae3b33221f48/plugins/importers.py

Or this code: https://github.com/simonw/datasette-app-support/blob/49078df0b6a038f8e0fe1bad5dfd42da4a6caff7/datasette_app_support/utils.py#L52-L88

I think the datasette-app-support implementation is better.

simonw commented 2 years ago

Weird, my prototype is giving me this OperationalError('disk I/O error')> error:

disk I/O error
Task exception was never retrieved
future: <Task finished name='Task-42' coro=<socrata.<locals>.run_the_import() done, defined at /Users/simon/Dropbox/Development/datasette-socrata/datasette_socrata/__init__.py:125> exception=OperationalError('disk I/O error')>
Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/datasette-socrata/datasette_socrata/__init__.py", line 146, in run_the_import
    await write_batch(batch)
  File "/Users/simon/Dropbox/Development/datasette-socrata/datasette_socrata/__init__.py", line 136, in write_batch
    return await database.execute_write_fn(_write, block=True)
  File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/datasette/database.py", line 150, in execute_write_fn
    raise result
  File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/datasette/database.py", line 172, in _execute_writes
    result = task.fn(conn)
  File "/Users/simon/Dropbox/Development/datasette-socrata/datasette_socrata/__init__.py", line 132, in _write
    db[table_name].insert_all(rows)
  File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/sqlite_utils/db.py", line 2956, in insert_all
    self.insert_chunk(
  File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/sqlite_utils/db.py", line 2721, in insert_chunk
    result = self.db.execute(query, params)
  File "/Users/simon/.local/share/virtualenvs/datasette-socrata-ocvQiLza/lib/python3.10/site-packages/sqlite_utils/db.py", line 463, in execute
    return self.conn.execute(sql, parameters)
simonw commented 2 years ago

I'm going to build a wheel and push it to S3 so I can test this out.

simonw commented 2 years ago

https://static.simonwillison.net/static/2022/datasette_socrata-0.1-py3-none-any.whl

simonw commented 2 years ago

The fix for that OperationalError('disk I/O error')> error turned out to be enabling WAL mode, which I'm currently doing in the startup hook:

https://github.com/simonw/datasette-socrata/blob/e203c181371563e15256cb8b7d7b9df2a09e9216/datasette_socrata/__init__.py#L242-L246

simonw commented 2 years ago

I deployed and tested this and it works! More work to follow in other tickets.