Open jplahn opened 8 years ago
I'm down to help out with the mass import of data to our DB. I think a prereq to this is to figure out or DB schema though.
I'll start looking at this once I clean up the auth integration.
Do we have a MongoLab account name?
It's called "Rental App", though I'm not sure the best way to handle access / permissions to it since it seems to be under my account. I can distribute the login creds to you guys if that makes the most sense.
Also, looks like Zillow breaks their city rental data down by type of residence (e.g. 1 bedroom data, 2 bedroom data, etc.) Perhaps we should combine different residence type data for a city into a single record (or a record for each year?), such that one query would fetch us all of the data for that city.
Yeah I'd love to have the granularity of # bedrooms, but initially we could combine them. Or even just choose one of them initially. In any case, I'd vote to start with 2 bedrooms as our "standard" unit, but open to counterpoints.
On a side note, it's kind of a shame they don't have # bedrooms > 2 (unless they do..) since apartments like that are most popular in college towns.
Oh they do, up to 5+ it looks like. And yeah, the more I think about it, combining everything might not be a great idea. Something more like "select * where city="San Francisco" and residence_type=*" would be give greater flexibility.
Oh great. I don't think it's a huge deal which one # bedrooms we choose at this point.
Based off your select statement, are you thinking a sql representation? I suppose it can be data dependent, since we might want to perform some aggregations that are better suited to JSON and offload those to the DB instead of constantly recomputing them server side.
Nah, not saying we should use sql, just needed to give an example of how we could get all data in a single query still if we separated the data into smaller records.
Got it. A lot of the data probably is suitable for tabular format in any case. D3 can handle csv data, so that isn't a limitation. For things like YoY trends I think tabular makes sense. When we want to do interesting visualizations that are overlays (such as crime heat maps or any other heat maps for that matter), JSON will be the approach.
So here's what I'm thinking for now (to keep it simple):
{ 'state': 'CA', 'city': 'Los Angeles', 'date': '2010-02', 'price_sqft': 1.57208588957055 }
If we want to script the data insertion via Python, we can convert the date in the csv (in the format "2010-02") to a Mongo accepted format via:
datetime.datetime.strptime('2010-02', '%Y-%m')
Sorry @mzamani1, missed this. A few comments:
I'll explain more coherently when I'm not about to fall asleep :).
@jplahn Yeah, that definitely sounds like a better approach. So top level fields would be something like: date, 'data object', and location. One thing I had missed was the info on what type of property this data's for. Something like "residence_type" could be another top level field perhaps.
Since a lot of the data we need can't be accessed via APIs, we'll need to download and store the data in our DB. Before we can do that, we need to figure out what it will look like. Zillow seems like the likely supplier of most of our data and their data comes in csv. This works with D3, but it may or may not be the best format.
As an extension of this, we'll also probably (definitely) want a tool to automate this process once we figure it out.