neuralinfo / Assignments

7 stars 20 forks source link

Assignment 3 Issues/Questions #20

Open neuralinfo opened 9 years ago

neuralinfo commented 9 years ago

Write your issues/questions about assignment 3 here

nickhamlin commented 9 years ago

The way the question is posed implies that we should calculate the lexical diversity for each individual tweet and store that information in Mongo. Is that correct and, if so, what information should the plot of lexical diversity contain? A histogram representing the lexical diversities of all the tweets in the corpus?

neuralinfo commented 9 years ago

More information is added to the instructions related to your question. If you still need further clarifications, please post your question here.

nickhamlin commented 9 years ago

Thanks, that explanation helps. Another question that came up in office hours with Luis just how: For those of us that chose to use the firehose for assignment 2 and didn't necessarily store all the metadata for each tweet, how should we think about part 1? My inclination would be to use the REST api to go back and re-gather some tweets, store them in S3 and MongoDB, and then use that corpus for the rest of the assignment. However, since we're now over a week removed from the NBA finals and the REST api won't allow us to go back in time beyond that, there are going to be very few tweets with those hashtags. If that's the case, should we use another hashtag about a more current event (I've been using #pride to test things out thus far) to ensure we get a good amount of data?

neuralinfo commented 9 years ago

In that case, you can use two different related hashtags about a more current event and use that corpus for the rest of the assignment. Make sure, you indicate these changes in your readme file as part of your submission.

kchoi01 commented 9 years ago

For parts 1-1 and 2-2, does it mean we need to continuously obtain tweets, store them and do the analysis as the tweets come in? How long should we run this?

neuralinfo commented 9 years ago

You do need to gather all the available tweets associated with the hashtags as indicated in the instructions. In most cases you can get one week worth of tweets.

nickhamlin commented 9 years ago

Two clarifying questions regarding part 2.3: 1.) Since the Twitter API doesn't support historical querying of follower information, is it the case that we need to gather a list of all the followers for the users from 2.1, wait a week, then gather another list of the followers for those same users, and compare the differences between the two to find the "unfollowers".

2.) In the event that our users from 2.1 contain celebrities that have millions of followers, are we expected to store them all? For example, if @BarackObama supplies one of our top 30 retweets, do we need to store all 61.4 million of his followers?

neuralinfo commented 9 years ago

1) Yes. That is one way to find the difference in the list of followers. 2)You are supposed to store all the followers. However, for the case of celebrities, you can assume an upper-bound for the number of followers that you decide to gather. That is your design decision and make sure you mentioned that in the readme file with sufficient explanation.

vincentchio commented 9 years ago

I am a bit confused about the difference between 1.1 and 1.2. In 1.1, are we storing the raw JSON returned from Twitter API to MongoDB. Whereas, in 1.2, we are storing just the tweet text to MongoDB? Also, in 1.1, do we need to fetch the data again by calling the Twitter API or can we just simply reuse the raw data that we gathered from Assignment 2.

neuralinfo commented 9 years ago

1- In 1.1 you are loading the data from the source as you retrieve the data (realtime data storage) whereas in 1.2 you are loading the data from a gathered/chunked dataset (offline storage). If you look at 2.1, you should be able to find out what needs to be stored in 1.2

2-in 1.1, you need to fetch the data again by calling the Twitter API.

jamesgray007 commented 9 years ago

Follow-up to @vincentchio question:

Task 1.1 -> I don't see anything in the instructions that sets a requirement for real-time given that it states the use of the REST API. My interpretation of this requirement was loading the JSON I stored locally on my device and loading into MongoDB. Please confirm that loading JSON stored locally into MongoDB meets the expectation or not.

Task 1.2 -> My interpretation is loading data from S3 into MongoDB. I do not see an answer to @vincentchio question on whether this is the entire JSON for each tweet or just the tweet "text".

The instructions also state we can re-use our JSON's from Assignment 2.

neuralinfo commented 9 years ago

In task 1.1, the instruction says: "write a a python program to automatically retrieve and store the JSON files returned by the twitter REST api. If you are using the JSONs that you gathered for Assignment 2, you only need to load JSON stored locally into MongoDB since we do not want you to repeat the same task. However, if you are using different hashtags, you probably may want to store the JSONs directly as it is not a good design decision to store them locally and load them later.

Task 1.2, you need to load the data from S3 into MongoDB. However, you need to decide yourself what fields you need to import to answer the analysis parts. For example, if you store the entire JSON rather than tweet text, you probably impose more storage overhead if the extra information that you imported (i.e. other fields/key in the JSON files) is not necessary to answer the analysis parts. On the other hand, If you store the tweet text, would you able to answer the analysis parts such as finding the usernames (users authored the retweets) and the locations of users as asked in 2.1 without incurring the storage overhead. These are design decisions that we wanted you to think about.

hdanish commented 9 years ago

For question 2.3 since the tweets are based from question 2.1, which is in turn based off of 1.2 (assignment 2), will it be fine to essentially re-run a similar process at this point in time and compare the followers? It may be longer than a week's difference but I would think the logic should be the same whether it's a one week difference or multiple week difference?

neuralinfo commented 9 years ago

That is fine. Make sure that you mention this in your readme file.

hdanish commented 9 years ago

When we are retrieving the 30 top retweets, how are we supposed to handle items that have been retweeted by many different users. It seems as though the retweet_count property of the tweet refers to whatever the original tweet was. So for example, if Lebron had a tweet that was retweeted 5000 times, including by users A, B and C, then the retweet count will show up as 5000 for each of the retweets by those users. If it's a particularly popular tweet, it's possible that the 30 retweets all reference the same original tweet. How are we meant to handle a situation such as that or is it fine that we will have the same tweet text over and over again but for different users?

neuralinfo commented 9 years ago

It is not acceptable to have the same tweets text over and over again. You need to find a way to check/filter such tweets. You may want to look at RT at the front of a tweet or other features that twitter provides.

kchoi01 commented 9 years ago

For 2-2, are we just looking at the tweets we have collected or do we have to go to twitter and get all the tweets for each user?

neuralinfo commented 9 years ago

The instruction says "you need to find all the tweets of a particular user". If you have already stored them during the acquisition phase, you can use them. Otherwise, you need to get them from Twitter (tweets that are available to you through REST api)

hdanish commented 9 years ago

Do we need to include retweets as part of a users tweets or is this a design decision we can indicate in the readme?

kchoi01 commented 9 years ago

I guess I should clarify the question. Does "all the tweets of a user" mean all the tweets of that user with hashtags 'NBAFinals2015" and "Warriors" (i.e. the ones we have already collected) or ALL tweets by that user (i.e. do a new search with the user as the only search param)

neuralinfo commented 9 years ago

@hdanish: it is your design decision.

neuralinfo commented 9 years ago

@kchoi01: ALL tweets by that user. The majority of the users only have 1 or 2 tweets with hashtags 'NBAFinals2015" and "Warriors".

hdanish commented 9 years ago

Just to confirm on the last point, if a user has 50k tweets then we would need to retrieve all of them? Lots of users have tens of thousands of tweets so we might be retrieving millions of tweets for all our users combined?

vincentchio commented 9 years ago

Adding to @hdanish 's comment. I have 200000 unique users in my db_restT. If I retrieve 1000 tweets per user it will take 34 days to complete the search using the current max rate limit (60000 tweets/15 mins). I hope that the number of tweets per user that we fetch can be adjusted according to one's situation. In my case, I only plan to fetch 100 tweets max per user and it would take 3.5 days. I hope the number of tweets per user is not a hard requirement but a soft requirement per student cases.

neuralinfo commented 9 years ago

@hdanish and @vincentchio : Though your program should be able to retrieve all of the users' tweets, the number of tweets per user that you pull is not a hard requirement. You can come up with a reasonable upper-bound for the maximum number of tweets that you pull. Make sure you mention this in your readme.

dunmireg commented 9 years ago

In a similar vein is it ok to limit the followers for Task 2.3 to say the first 10,000 followers retrieved?

neuralinfo commented 9 years ago

This has already been answered. Check the answers to @nickhamlin question in one of the above posts.

dhavalbhatt commented 9 years ago

Running mongod gives an error 'MongoDB Insufficient free space for journal files' how do you recommend we go about it? Thanks

neuralinfo commented 9 years ago

Try running MongoDB without journaling: ./bin/mongod --config=myconf.conf --nojournal &

nminus1 commented 9 years ago

for 2.3 some of us will be unable to complete the data collection in time to be able to compare a weeks time lag. is the "one week" a hard requirement ? or could the task be implement with the best time lag we can manage within the time limitations ?

neuralinfo commented 9 years ago

You can implement 2.3 with the best time lag you can and make sure you indicate this in your readme file