Analyze - ML or Statistics or any fashion to analyze the data

rcliao commented 10 years ago

Lets use this issue to document what we have for the analyze part. @jkroening @surhorse

If you have any thought, please comment at this issue so that we can all document what we have in one place.

The data we have so far contains the following attributes

Attribute Name	Description/Note
photo_id	Unique Identifier
tags	List of tags that author puts
gelocation	longitude and latitude indicate where this photo is being taken, main used to show on the google map right now
date_taken	when photo is being taken, meta data related to photo
date_posted	when the photo is being uploaded to the Flickr Service
views	how many views this photo has when we downloaded the data
locale	where the photo is in String format
county	county
region	region
url	an url indicate where this image is on the Flickr server, mainly used to display the image on the visualization

dwyoung514 commented 10 years ago

Up until now, our biggest struggle has been with how to deal with tags and we have yet to find a decent way to handle them. Instead, what if we just completely disregarded tags and try to come up with something based on what can be more easily quantified mathematically: geolocation, date_taken, date_posted, and views. A count of the photos can be an additional statistic for 5 fields which should be plenty from the photo perspective. (The locale location can be substituted for geolocation if we need for clustering purposes.) In short, we have plenty of information as it pertains to the photos.

I believe that we are having trouble finding things to analyze because we can't find a way to use the tags. I think we finding outside data would help in our analysis. For example, if we throw in data on the which locations rate highly as a good place to live then we can start to make predictions based on score vs tags/locations/quantity of photos.

What do you guys think? Can we make it with just what we have or would getting additional information help?

dwyoung514 commented 10 years ago

Another idea I thought about was to look at how a large difference between date_taken vs date_posted would affect views assuming identical tags. We can remove the tags that are insignificant (less than a certain amount that exist) and take an average across all tags with the same delta. Then graph out how drastic or insignificant waiting to upload a picture is to uploading.

This could potentially give us nothing to predict though... the graph might be completely boring.

jkroening commented 10 years ago

I like this idea on how long people wait to upload but let's not get too greedy. Forget tags for this date_taken, date_posted analysis.

As to the other comment. Tags are a part of the photo data. It's called metadata and it has intrinsic meaning, unlike the other data for photos wherein meaning must be derived. That said, I don't mind reducing our reliance on tags as a data point. However, why can't we still use the count of top tags and a top 10-20 tags by total view count as a couple cool graphs?

As to if we have enough. I'm certain it's going to be fine. But my idea of making a map where you can type a word in a search box and if it matches a tag in the data set it will flash on the map where pictures were taken with that tag, the bigger the flash the more views, and have it with a slider at the bottom that passes through time over the year.

Jonathan Kroening | jonathankroening.com

On Mar 7, 2014, at 7:04 PM, Daniel Young notifications@github.com wrote:

Another idea I thought about was to look at how a large difference between date_taken vs date_posted would affect views assuming identical tags. We can remove the tags that are insignificant (less than a certain amount that exist) and take an average across all tags with the same delta. Then graph out how drastic or insignificant waiting to upload a picture is to uploading.

This could potentially give us nothing to predict though... the graph might be completely boring.

— Reply to this email directly or view it on GitHub.

rcliao commented 10 years ago

Doesn't seem like we have time to do crazy analyzation at this moment, I think we will just focus on the visualization. Therefore, I will close this issue and just pretend there is nothing happen here. :+1:

really-lazy-bone / beautiful-data

Analyze - ML or Statistics or any fashion to analyze the data #8