simonw / big-local-datasette

Publishing a Datasette of open projects from biglocalnews.org
https://biglocal.datasettes.com/
2 stars 0 forks source link

Script to create a SQLite database from Big Local open projects #1

Closed simonw closed 4 years ago

simonw commented 4 years ago

This will fetch projects and files into a table. It won't try to fetch actual CSV data - that comes next.

simonw commented 4 years ago
import requests
response = requests.post("https://api.biglocalnews.org/graphql", json={
    "query": """
    {
  openProjects {
    edges {
      node {
        name
        createdAt
        updatedAt
        contact
        description
        isOpen
        id
        files {
          name
          uri
          uriType
          createdAt
          updatedAt
        }
      }
    }
  }
}
    """}, headers={"Authorization": "JWT {}".format(jwt)}
)
simonw commented 4 years ago

Here's a recipe for streaming CSV from a URL into a table:

import sqlite_utils, csv, requests

def url_to_dicts(url):
    response = requests.get(url, stream=True)
    reader = csv.DictReader(line.decode('utf-8') for line in response.iter_lines())
    yield from reader

db = sqlite_utils.Database("/tmp/biglocal.db")
db["address_table"].insert_all(url_to_dicts(url))
simonw commented 4 years ago

A project looks like this:

{
    "id": "UHJvamVjdDphYzdhMWIxMy04NjJhLTRkMzAtOThkMC0wODY3MTE5ZWIzNTU=",
    "isOpen": true,
    "name": "COVID_National Health Security Preparedness Index",
    "updatedAt": "2020-03-14T23:46:55.915000+00:00",
    "contact": "biglocalnews@stanford.edu",
    "createdAt": "2020-03-13T22:16:49.234228+00:00",
    "description": "Data provided by Big Local News.\nData from the National Health Security Preparedness Index.\nThe overall index is useful for examining a state\u2019s readiness in dealing with any number of issues. There also are specific metrics that directly relate to states\u2019 abilities to respond to the Coronavirus pandemic. With that in mind, we have pulled out and processed key metrics -- from preparedness for surge testing to evaluating how many people in each state have access to paid time off. Our goal is to make it easier for journalists to access and analyze for their reporting. For more information, please start with the NHSPI_READ_ME file. Questions? Contact biglocalnews@stanford.edu.\n\n",
    "files": [
        {
            "createdAt": "2020-03-14T19:56:50.031000+00:00",
            "name": "M107_per_lhd_emgncy_coord.csv",
            "updatedAt": "2020-03-14T19:56:50.031000+00:00",
            "uri": "https://storage.googleapis.com/bln_prod/project/ac7a1b13-862a-4d30-98d0-0867119eb355/M107_per_lhd_emgncy_coord.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=bln-storage%40big-local-news-267923.iam.gserviceaccount.com%2F20200409%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200409T220013Z&X-Goog-Expires=86400&X-Goog-SignedHeaders=host&X-Goog-Signature=2c5f2a58f30ef73e6765a1aeec4b5191e7186bf869dee8f6e28fbac5275592ca58fe0311be0078a7bc874958fe16779a783fbd1a0a532c5dfa0959e4cb46ebd93f24c0cbb990ce4110472430b58aa25d6f8df4cf18cfaa77d742d21e5875ec8589da946b9fa2725eca34c5a9141ebb614d1cf09879b0718bb3ff8d51e9cb3cd0bb6e1b86af342fce82ed6a225fb85bc0a3d0b4f4d32a5f216a1a458b84e77614a3f4882bd3d12b66e345606fd170206818212083e04d4d1856b9792b5f9225d9714b970ff996efd6e64afbd54a6b474194b894f533f8e723279f8b6da430cf6cf8f2400b72a551cc09dfb914bc09cf2e32aafe3936abbc8a070366fe4942bc9a",
            "uriType": "download"
        },
        {
            "createdAt": "2020-03-14T19:56:50.627000+00:00",
            "name": "M298_aiir_per_capita.csv",
            "updatedAt": "2020-03-14T19:56:50.627000+00:00",
            "uri": "https://storage.googleapis.com/bln_prod/project/ac7a1b13-862a-4d30-98d0-0867119eb355/M298_aiir_per_capita.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=bln-storage%40big-local-news-267923.iam.gserviceaccount.com%2F20200409%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200409T220014Z&X-Goog-Expires=86400&X-Goog-SignedHeaders=host&X-Goog-Signature=03f5c97f2d6e5df03188bbcea420f4a60c598db53710dd268fad19dffdc0fc5ed1d61760ab165c7e898e5820fafd68f2e43c29aa2ea3cd068936997956c81cef83b3a811bc6049618d6c425da0bd8feb02e809a23a9c7b59c0ffc09f4fff0691ab684f092341fc265ce89aaff4f1c525214427a2173b5eaf9631cad8bebf08c597f23708a31028c633887019748c60980db24fa2bb50198c285d877db6b93c849092b0b3e2d7d85e5250cc5bdae5c04eedab8d55c340d74fb1dcb70c905a8e1ebd7db077005f698595a82daf79eb4cb274ca352936e224a214764a405ba13d5958a33057cd268f842fc6db0a25c7edb58e53741cb77a40481ded6062c5a50adc",
            "uriType": "download"
        },
        {
            "createdAt": "2020-03-14T19:56:50.268000+00:00",
            "name": "M530_per_employed_w_pto.csv",
            "updatedAt": "2020-03-14T19:56:50.268000+00:00",
            "uri": "https://storage.googleapis.com/bln_prod/project/ac7a1b13-862a-4d30-98d0-0867119eb355/M530_per_employed_w_pto.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=bln-storage%40big-local-news-267923.iam.gserviceaccount.com%2F20200409%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200409T220014Z&X-Goog-Expires=86400&X-Goog-SignedHeaders=host&X-Goog-Signature=24b25861a36f52e0112798b5b6b8599f71d14e054efad882a26f95651264e45e1bab325d3f9648e15c00f589b2567c16bc4b4c62f8d5706654ea16cdf1d7ae77a1c23a1290434ae1c767e390ff8da0ac9d133144d6455005fd71e21e3ad9a757d01f96f714a957effb754997b51264816ead5f0ab33bb96e744b8578c9e93dd87424dcd581a107fa816a127e57a3f325aa4b56e6902f42719c8723b718cce6becd891862b8da8837e22e3cca75823f549281ffbcc73479b0f995095f24f52c478e85d39565cea128c60ca557d7064cb605d62b571cd597290b6a98165b520251440fd0170a4171b5faab74f801b1b7c19934f6b64ddf92bf1dedf3770d5f2e49",
            "uriType": "download"
        },
        {
            "createdAt": "2020-03-14T19:56:50.031000+00:00",
            "name": "M8_surge_testing.csv",
            "updatedAt": "2020-03-14T19:56:50.031000+00:00",
            "uri": "https://storage.googleapis.com/bln_prod/project/ac7a1b13-862a-4d30-98d0-0867119eb355/M8_surge_testing.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=bln-storage%40big-local-news-267923.iam.gserviceaccount.com%2F20200409%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200409T220014Z&X-Goog-Expires=86400&X-Goog-SignedHeaders=host&X-Goog-Signature=793230d8acbfdb8dccaabe3a478f5bdf3470cbbf11a7937fe1397565c924b107f515a18d0c0c4bb40e5ffe7d14502b1a6aa49809648cf383d02a6e999d0ecadb02dd2b05d732674fb21e3564c544bfdf294b92c41f8bb9d328fa0b1eba0d3984fe865517f71895e7821e95ecf94bd6ff269de70b3e9c974460d0fc3326dc70410ff5ae610ac8b5f81bc63293c31a74b14fc109f0fc423c7e03c1a41057a6d98bfa29852eb06c7072a368b8263e0dc0b590f95b1b2629c1ac28402c753bfd0d2be89928de9981bae32d1ce858aee201049b0df853e62f5259b70d72a871066b2653daaafd07dd178da71820eb97fcd819d7f14c9019d51ee3209fe767f9d40576",
            "uriType": "download"
        },
        {
            "createdAt": "2020-03-14T23:46:55.915000+00:00",
            "name": "NHSPI_READ_ME.pdf",
            "updatedAt": "2020-03-14T23:46:55.915000+00:00",
            "uri": "https://storage.googleapis.com/bln_prod/project/ac7a1b13-862a-4d30-98d0-0867119eb355/NHSPI_READ_ME.pdf?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=bln-storage%40big-local-news-267923.iam.gserviceaccount.com%2F20200409%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200409T220014Z&X-Goog-Expires=86400&X-Goog-SignedHeaders=host&X-Goog-Signature=4ea30373cb69ff245fbd669fb7ebf2f705dca0e5ec72573ed40bf3313f4bd5184b8575aa22da5a6f7ac74d4fb7af9a278275d6e502eb1951bc610df5ab34b87aab88cff26e4191603713450c9765f5ac58288144073a3074811be49df61bdeb09d567fea1e52eac808ebf862fc1717713ecb99812b34327bb42dc7e27a1d911944827c3ad3ef7feb97663caf02d5ff9c6fc605d871cc74117f68bc9729c4aed038c7530b537df4bb379c9d205f5207bafc151c63aa53fb56673d6dbaa3e820b841711c713c4edbc8c7599e6c91bba335cc53d8af8561bf2d943029d91b6a9f1cc9e4e777fe58301fd6b5514985d56db12428016940469a2b137890a5a941cffa",
            "uriType": "download"
        },
        {
            "createdAt": "2020-03-14T22:15:35.280000+00:00",
            "name": "NHSPI_domains_subdomains.csv",
            "updatedAt": "2020-03-14T22:15:35.280000+00:00",
            "uri": "https://storage.googleapis.com/bln_prod/project/ac7a1b13-862a-4d30-98d0-0867119eb355/NHSPI_domains_subdomains.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=bln-storage%40big-local-news-267923.iam.gserviceaccount.com%2F20200409%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200409T220014Z&X-Goog-Expires=86400&X-Goog-SignedHeaders=host&X-Goog-Signature=848dfbe60555cc99d941b6b36d97a2fc77b2fabb7c6e0b68c6732868082b30f2a8a5c2ca856b3c00c7fe5af4afc34d6bc89ffe8736441c877e88116f7724995e58b61bb3e36de3247a6a126af0fc4ec5bb19b39c05ad174296c980f27a2bc6595703b3425e58de7e65090f9e6c975a48d6ee9a50309caf0435d5bae716b99da9c97e80b6fea25e9d55f591a1233105269102c6438ece47d61e818d7dc3ea266d6641dd7b88fa9428ab61098e90841285dab29661283c616f9f688506008b7e617571879d9ef679b9d99ddb7a24d0e84a38aefce7fa077e572c08469ddaf2123e29166a5968307cdd403faff56431b1ed967a1dde4dbc4e1641682c265a7c3a6b",
            "uriType": "download"
        },
        {
            "createdAt": "2020-03-14T22:04:38.226000+00:00",
            "name": "NHSPI_measure_metadata.csv",
            "updatedAt": "2020-03-14T22:04:38.226000+00:00",
            "uri": "https://storage.googleapis.com/bln_prod/project/ac7a1b13-862a-4d30-98d0-0867119eb355/NHSPI_measure_metadata.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=bln-storage%40big-local-news-267923.iam.gserviceaccount.com%2F20200409%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200409T220014Z&X-Goog-Expires=86400&X-Goog-SignedHeaders=host&X-Goog-Signature=0acc46d49cf66c898eec4d316aafcbe2d516046f953d52db66ddbda0338b16c8bb3406abe4fa0b8dc9e233a22c8bbf394b266b8a933fc79003909e4e0b3ddb150641ae5e5be171e5e5bf153b4f5454c9e75e12e2104de4532683624cbe68f2ee45565126d51509c81fff7caf70ae0cf8473de53ca93d43c47ee6e7692c1ede1da9775d196773f2198b18d2f785a0bb320dc43f86a56f7e6f12f8aed49542d529c00cc0a58724d87d58710ba8b541148fb7fb98df02f4ed8d0770feebef73c956c92aed404424a0884d6ec6d1b4e64b20daed4c4c28fc30dafe8911458ac860b9631580df5459cc76f071c0e0a1c5d84ce0df6bded16378e9e5eb777630c4190e",
            "uriType": "download"
        },
        {
            "createdAt": "2020-03-14T23:31:29.611000+00:00",
            "name": "NHSPI_orig_data_and_docs.zip",
            "updatedAt": "2020-03-14T23:31:29.611000+00:00",
            "uri": "https://storage.googleapis.com/bln_prod/project/ac7a1b13-862a-4d30-98d0-0867119eb355/NHSPI_orig_data_and_docs.zip?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=bln-storage%40big-local-news-267923.iam.gserviceaccount.com%2F20200409%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20200409T220014Z&X-Goog-Expires=86400&X-Goog-SignedHeaders=host&X-Goog-Signature=1e1c4390b1937226910e28d1f5b318ca84d10f352517960ed1400d9c54c022fe4aa55ee926633aee051fefdea31dcfd534f49923ce277b27fd49bd33dcaeca41a64f0c105fd957ee72b3f81956f5ce2849fe5986460272f4d17334877c3faa489a3fdd04c248860335d787ea8549572d29fb95c35bccb9a67bb76126dd5cf1b8a01c36c4ebf63654880f0bcfb8acf9201038c162c9323cb90b9d56fd46e38f9dba0fc78cd1add7fa38b8099d3186550a2a6d962c3a562a067e76d7d11625b3ebfdfda95a43e589e33d097bae62e8f0277fce6793f77f0baa4b010a2e5d9951abd17cad6cf9ca0e529820efb9b0b1151ad59ab6d0d846597c9cd451867dd39bb5",
            "uriType": "download"
        }
    ]
}
simonw commented 4 years ago

I'm going to use a HEAD request on each file to get its eTag and size - I will only import files smaller than a certain threshold and which have not been imported before.

I will start by populating tables of all of the files and projects. A second step can then fetch the CSV data and create tables for it.

simonw commented 4 years ago

biglocal__select_projects_id_as_project_id__projects_name_as_project_name__files_name_as_file_name__size__etag_from_files_join_projects_on_files_project___projects_id_where__ext_____p0_order_by_size_desc