jplahn / rental_app

0 stars 0 forks source link

Define schema for different data #5

Open jplahn opened 8 years ago

jplahn commented 8 years ago

Since a lot of the data we need can't be accessed via APIs, we'll need to download and store the data in our DB. Before we can do that, we need to figure out what it will look like. Zillow seems like the likely supplier of most of our data and their data comes in csv. This works with D3, but it may or may not be the best format.

As an extension of this, we'll also probably (definitely) want a tool to automate this process once we figure it out.

natalie-zamani commented 8 years ago

I'm down to help out with the mass import of data to our DB. I think a prereq to this is to figure out or DB schema though.

jplahn commented 8 years ago

I'll start looking at this once I clean up the auth integration.

natalie-zamani commented 8 years ago

Do we have a MongoLab account name?

jplahn commented 8 years ago

It's called "Rental App", though I'm not sure the best way to handle access / permissions to it since it seems to be under my account. I can distribute the login creds to you guys if that makes the most sense.

natalie-zamani commented 8 years ago

Also, looks like Zillow breaks their city rental data down by type of residence (e.g. 1 bedroom data, 2 bedroom data, etc.) Perhaps we should combine different residence type data for a city into a single record (or a record for each year?), such that one query would fetch us all of the data for that city.

jplahn commented 8 years ago

Yeah I'd love to have the granularity of # bedrooms, but initially we could combine them. Or even just choose one of them initially. In any case, I'd vote to start with 2 bedrooms as our "standard" unit, but open to counterpoints.

On a side note, it's kind of a shame they don't have # bedrooms > 2 (unless they do..) since apartments like that are most popular in college towns.

natalie-zamani commented 8 years ago

Oh they do, up to 5+ it looks like. And yeah, the more I think about it, combining everything might not be a great idea. Something more like "select * where city="San Francisco" and residence_type=*" would be give greater flexibility.

http://www.zillow.com/research/data/#rental-data

jplahn commented 8 years ago

Oh great. I don't think it's a huge deal which one # bedrooms we choose at this point.

Based off your select statement, are you thinking a sql representation? I suppose it can be data dependent, since we might want to perform some aggregations that are better suited to JSON and offload those to the DB instead of constantly recomputing them server side.

natalie-zamani commented 8 years ago

Nah, not saying we should use sql, just needed to give an example of how we could get all data in a single query still if we separated the data into smaller records.

jplahn commented 8 years ago

Got it. A lot of the data probably is suitable for tabular format in any case. D3 can handle csv data, so that isn't a limitation. For things like YoY trends I think tabular makes sense. When we want to do interesting visualizations that are overlays (such as crime heat maps or any other heat maps for that matter), JSON will be the approach.

natalie-zamani commented 8 years ago

So here's what I'm thinking for now (to keep it simple): { 'state': 'CA', 'city': 'Los Angeles', 'date': '2010-02', 'price_sqft': 1.57208588957055 }

If we want to script the data insertion via Python, we can convert the date in the csv (in the format "2010-02") to a Mongo accepted format via: datetime.datetime.strptime('2010-02', '%Y-%m')

jplahn commented 8 years ago

Sorry @mzamani1, missed this. A few comments:

  1. I think we still need a more top level schema definition, i.e. something that defines the metadata around a particular dataset.
  2. We probably want to find a way to separate common elements (things like city, state, etc.) from data specific elements (like price_sqft or growth_yoy). So maybe we'd augment your data schema above with a 'data' object and shove 'price_sqft' into it and then move all the other data into something like 'location' and then include all of the metadata on the top level.

I'll explain more coherently when I'm not about to fall asleep :).

natalie-zamani commented 8 years ago

@jplahn Yeah, that definitely sounds like a better approach. So top level fields would be something like: date, 'data object', and location. One thing I had missed was the info on what type of property this data's for. Something like "residence_type" could be another top level field perhaps.