trulia / choroplethr

choroplethr simplifies the creation of choropleths (thematic maps) in R
Other
141 stars 53 forks source link

Can you obtain data for just one state #13

Closed pssguy closed 10 years ago

pssguy commented 10 years ago

The current get_acs_df downloads all US info leaving the selection by state to the render_choropleth function

This may be the appropriate option on occasion, but the time to download zip and county data can take some time . would it be possible to set a state (or vector of states) argument within get_acs_df with a default of all? Provided, of course, that there was a time saving

alamstein-trulia commented 10 years ago

I considered that, but I think that in this case the time savings will be minimal. While I haven’t profiled this specific case, in general the bulk of the time in these operations is opening and closing the connection to the server. This is just retrieving information - it’s not a computationally intensive operation for the server.

From: pssguy notifications@github.com<mailto:notifications@github.com> Reply-To: trulia/choroplethr reply@reply.github.com<mailto:reply@reply.github.com> Date: Tuesday, March 11, 2014 at 1:42 PM To: trulia/choroplethr choroplethr@noreply.github.com<mailto:choroplethr@noreply.github.com> Subject: [choroplethr] Can you obtain data for just one state (#13)

The current get_acs_df downloads all US info leaving the selection by state to the render_choropleth function

This may be the appropriate option on occasion, but the time to download zip and county data can take some time . would it be possible to set a state (or vector of states) argument within get_acs_df with a default of all? Provided, of course, that there was a time saving

— Reply to this email directly or view it on GitHubhttps://github.com/trulia/choroplethr/issues/13.

pssguy commented 10 years ago

Really though I cant remember the use/system defs?

system.time(get_acs_df("B02001", "state",column_idx = 1)) user system elapsed 0.31 0.09 0.92 system.time(get_acs_df("B02001", "county",column_idx = 1)) user system elapsed 3.21 0.94 9.18 system.time(get_acs_df("B02001", "zip",column_idx = 1)) user system elapsed 39.94 10.08 68.07 system.time(get_acs_df("B02001", "state",column_idx = 1)) user system elapsed 0.58 0.03 0.77

alamstein-trulia commented 10 years ago

That demonstrates that as we go from (50) states to (3k) counties to (30k) zips the time goes up.

The next step in the investigation is to rerun the code where get_acs_df just gets one state, such as CA. The code for get_acs_df is here:

https://github.com/trulia/choroplethr/blob/master/R/choroplethr_acs.R

As an experiment, you can rewrite the function make_geo to make a geography that returns that -given "state" returns a geography for CA -given "county" returns all counties in the state of CA -given "zip" returns all zips in the state of CA

make_geo calls geo.make, which is from the acs package. If you wind up doing this comparison please report the results back.

pssguy commented 10 years ago

OK I will look at this . However, there is a current problem with an inconsistency between state and county/zip returns Currently the state returns just the column selected. The others return all 10 columns for a table viz df_state <-get_acs_df("B02001", "state",column_idx =3) #nrow(df_state) 10

df_county <-get_acs_df("B02001", "county",column_idx =3) #nrow(dfcounty) 32210 df_zip <-get_acs_df("B02001", "zip",column_idx =3) #nrow(df_zip) 331200

So for a 10 column table I am getting 10x as much data as anticipated

I have not yet looked at your suggestion above but am trying to balance up time-wise whether it is better to get all columns down from a table and then use the column selected to draw the map. Or each time a different column is chosen draw down that data and then do map. The former is a high initial cost but then quick whilst the second evens it out more. I may want to differentiate the procedure between regions. For instance calling down all zip columns is too time-consuming where there are many columns, whilst for state it is relatively trivial

If you could have something like column_idx=0 calls down all data, otherwise just the column required then I can best reach an optimal solution dependent on region selected and number of columns the table has

tx

arilamstein commented 10 years ago

You are correct - there was a bug for county and zips. I just fixed it. Please try again. https://github.com/trulia/choroplethr/commit/b0d4dd731ede493a627a2a6418274b9a35dcffeb

arilamstein commented 10 years ago

If your primary concern is speed, I would recommend downloading all the ACS data and storing it locally. For example, in a MySQL database.

pssguy commented 10 years ago

OK the 'bug' is fixed. Now I am downloading what I thought I had been, I can see that you were right about the connection time being the major factor. The number of columns (and presumably number of states) will have little impact on download time. The big processing cost for me was that I was merging what I thought were two separate columns but was in fact a 10x10 column merge taking a factor of 100 longer So what this boils down to is that the best way for me is to download all columns at once and then process or at least have that option. Is that feasible via the column_idx=0 option I mentioned above? i.e. the bug was actually what is best. Because, of course, get_acs_df("B02001", "county") results in the selection option in console which, for my purposes, is not desired

Re downloading ACS data to a db. Well that is an option for the odd table but for all ACS tables would presumably be v large. I think for state and county info, at least, the casual user will be OK

arilamstein commented 10 years ago

Right now I am not interested in adding additional ACS support for choroplethr, so this will likely not get implemented any time soon. But this shouldn't stop you for writing code to do this yourself. If you are not sure how to proceed I recommend looking at the file choroplethr_acs and also asking on the acr.R mailing list.

pssguy commented 10 years ago

Thanks for suggestion. I have forked and reverted to previous code which gets what I want, albeit not that elegantly. I'll let you know if I have any success on that colour scale

arilamstein commented 10 years ago

Excellent. I just looked at your commit. One thing that helped me when dealing with the acs objects is the str command (?str). The acs objects are s4 objects, which I hadn't encountered before. str helped me to visualize all the parts of the object much easier.

Other resources are the acs mailing list: http://mailman.mit.edu/mailman/listinfo/acs-r and the various manuals that Ezra has written (e.g. http://dusp.mit.edu/sites/all/files/attachments/publication/working_with_acs_R.pdf )

On Thu, Mar 13, 2014 at 4:27 PM, pssguy notifications@github.com wrote:

Thanks for suggestion. I have forked and reverted to previous code which gets what I want, albeit not that elegantly. I'll let you know if I have any success on that colour scale

Reply to this email directly or view it on GitHubhttps://github.com/trulia/choroplethr/issues/13#issuecomment-37599656 .