walkerke / census-with-r-book

Source for Analyzing US Census Data: Methods, Maps, and Models in R by Kyle Walker, published with CRC Press
https://walker-data.com/census-r
Other
79 stars 27 forks source link

Speedier database import #5

Open gergness opened 3 years ago

gergness commented 3 years ago

This looks great, so happy to see it all put together!

At lunch with @dtburk, we talked about how slow it sounded to import the 1910 full count data in section 11.2.2.

I was curious and so figured I'd report back, but please don't feel like you need to add another caveat about workflow. Also, I still remain mystified by database administration, so there may be drawbacks I'm unaware of. Plus dBeaver is a nice thing to show to help people get comfortable with databases.

Those caveats aside, in my experimentation, I found chunked reading to be a lot faster than dBeaver's import. On my machine it was probably closer to 2.5 hours through dBeaver, but this takes ~15 minutes, loading only 10k rows at a time (it also works with both csv and fixed width extracts).

read_ipums_micro_chunked(
  "usa_00004.xml", 
  IpumsSideEffectCallback$new(function(x, pos) {
    dbWriteTable(
      conn, Id(schema = "ipums", table = "census1910"), x, append = TRUE
    )
  })
)
walkerke commented 2 years ago

Thanks @gergness! As your work is a significant part of the book I really appreciate your feedback.

This is an excellent suggestion and I think it's worth a mention in the book. I'm not a DBA person myself either, which is why I like including the DBeaver section as it's my go-to for browsing databases (and non-DBA users are the general target in that section). However, 2.5 hour load time is going to be unreasonable for some readers so I think your example makes a lot of sense, probably in a callout box in that section. I'll include it in the next batch of edits.