stat157 / recent-quakes

Stat 157 Homework 2 due on Monday 2013-10-21 at 11:59pm
0 stars 20 forks source link

Steps to Curate Data #8

Open davidwang001 opened 10 years ago

davidwang001 commented 10 years ago

I'm pretty new to python and I know others are as well.

It would be really helpful if someone could briefly list some steps to curate the data so I can have some direction in my google/stackoverflow searches.

Thanks!

kqdtran commented 10 years ago
  1. Get one of the URLs at http://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php
  2. Use urllib to read in the content of the url, which is in JSON format
  3. Extract data from the JSON content. If you use json.load, I believe the data is stored in a dictionary (key-value), so you can access them using bracket notation, e.g. quakes['geometry'].
  4. You can do anything with the data now, store them in a list, populate the data into a Pandas object like DataFrame, or cache them somewhere using cPickle
  5. ...
  6. Profit!

Let me know if that helps! ~

davidwang001 commented 10 years ago

Yes this helps a lot, thanks so much

tandrasfay commented 10 years ago

@kqdtran I am ok until trying to access the data using quakes['geometry']. What do you mean by populate the data into a pandas object? I can't get pd.DataFrame.from_dict(quake_vals,orient="index") to work. I keep getting AttributeError: 'list' object has no attribute 'values' And if I try to subset inside the list to get into the dictionary, AttributeError: 'unicode' object has no attribute 'keys'

kqdtran commented 10 years ago

@tandrasfay quakes['geometry'] is just an example :D, in this case, it happens to return a list of coordinates. You will want to turn that into some sort of dictionary, for example dict = {'coords': quakes['geometry']}, which says "I want the string 'coords' to point to that list of coordinates (think of it as column header, and the list is a value of one row under that column). Then you should be able to turn that dictionary into a Pandas DataFrame object.

If you want to index a list (is that what you meant by 'subset'?), you can access the individual item using bracket notation, or you can just do for item in quakes['geometry'], which should then give you each of the three coordinates for longitude, latitude, altitude respectively.

reenashah commented 10 years ago

Thanks for all your help @kqdtran! This is really helpful stuff. I was wondering if anyone knows how to implement the for loop to access individual items; I originally used bracket notation to get specific values of longitude, latitude, etc., but I'm realizing that this isn't very "reproducible".

Right now I'm stuck with something like this:

for 'place' in quakes['properties']

But I am unsure what to do after this to get the output I'm looking for :(

kqdtran commented 10 years ago

@reenashah No worries :-). You're right that using bracket notation isn't really reproducible, and is also error-prone should the data changed. However, I discussed this method with Aaron in OH today, and he said it's fine to extract the data this way since we know how it's represented. It's a bit cumbersome/messy compared to CSV, but... well, it's real world data xD. Additionally, since we will have to cache the data once we're done curated it, even if the live json changes, our code should still work with the local data. So in a way, it's "reproducible".

Here's a trick that makes our lives easier: use pprint. It stands for pretty print, and it will show you the data in a much nicer format, compared to the standard print statement. For example:

from pprint import pprint

data = pd.DataFrame(d.items()) # assuming you used json.load to get the data
features = data[1].values[1] # since we don't care about metadata, bbox, etc.
pprint(features)

You should see something like this

2013-10-19_18-59-19

So now that we know the structure of the data, we can loop over them like: for quake in features, and then start extracting the data from each "quake" using square bracket notation. For dictionary {}, inside the bracket will be the key, i.e. 'properties'. For list [], it will be 0, 1, 2, etc., or you can do for item in list. The data is nested in multiple levels, so we will have to grind it out the hard way to find what we need...

Let me know if that helps!

reenashah commented 10 years ago

OMG @kqdtran yes this was so incredibly helpful!! Thanks so much for taking the time to help out, really appreciate it!! :D

aculich commented 10 years ago

@kqdtran This is really fantastic work! Thanks for doing a great job figuring this out and sharing it with everyone else. Your work here truly captures the essence of the kind of collaboration we are trying to achieve!

Thanks!

-Aaron