walkerke / pygris

Use US Census shapefiles in Python (port of the R tigris package)
https://walker-data.com/pygris
MIT License
107 stars 16 forks source link

data.get_census() for 50+ variables #1

Closed jbousquin closed 1 year ago

jbousquin commented 1 year ago

API won't return more than 50 variables at once, error is descriptive enough:

SyntaxError: Request failed. The Census Bureau error message is error: error: 'get' is limited to 50 variables

Suggested enhancement is to split the variables over multiple requests (similar to cenpy, but splitting over variables and concat on cols vs rows).

jbousquin commented 1 year ago

Not the most elegant, but suggesting something like:

    data=[]
    n_chunks = np.ceil(len(variables) / 50)
    for chunk in np.array_split(variables, n_chunks): 
        joined_vars = ",".join(chunk)
        params.update({'get': joined_vars})

        req = requests.get(url = base, params = params)
        if req.status_code != 200:
            raise SyntaxError(f"Request failed. The Census Bureau error message is {req.text}")

        out = pd.read_json(req.text)
        out.columns = out.iloc[0]
        out = out[1:]

        data+=[out]  # Add output from each chunk to list

    out = pd.concat((data), sort=False, axis=1)
walkerke commented 1 year ago

@jbousquin makes sense to me! This is similar to what we do in tidycensus as well. I wrote get_census() as a minimal interface to the API that I'd refine if people started using it. I'll take a look at integrating this, or feel free to submit a PR if you'd like.

walkerke commented 1 year ago

So I don't think the above solution will work - see https://github.com/hrecht/censusapi/issues/82 and https://github.com/walkerke/tidycensus/pull/165. The problem is that the sort order of rows is not always consistent across data pulls.

I'll poke around at this and come up with a workable solution.