neuralinfo / Assignments

7 stars 20 forks source link

Assignment 4 Issues/Questions #23

Open neuralinfo opened 9 years ago

neuralinfo commented 9 years ago

Write your issues/questions about assignment 4 here

holatuwol commented 9 years ago

The "until" parameter of the Twitter search API says it's limited to 7 days, but the assignment asks us to find tweets starting from over a month ago. Just wanted to make sure I wasn't missing some obvious way to get the required date range? https://dev.twitter.com/rest/reference/get/search/tweets

neuralinfo commented 9 years ago

As explained in the second paragraph of the data collection task, you need to come up with a strategy to get the historical data that is not available through the twitter API. One possible way is to scrape the twitter Advanced Search results for the requested time frame.

dunmireg commented 9 years ago

2 Questions:

1). Is the choice of what information to store in the csv file our design decision? That is I can say I am only interested in the user and the time and tweet text and just dump the rest (which will then inform what kinds of queries I can run).

2). For the tweet hashtags are we gathering the tweets that have #FIFWWC or any of the team hashtags? So this would be 24 team hashtags plus FIFWWC

nickhamlin commented 9 years ago

Similar to Ted's question: can you provide a ballpark for a "reasonable" number of tweets?

neuralinfo commented 9 years ago

@dunmireg 1) you can store only the information that you need for the analysis part. 2) FIFWWC plus 24 team hashtags.

neuralinfo commented 9 years ago

@nickhamlin: There is no minimum requirement for the number of tweets. You should do your best to acquire as much as tweets possible. We'd like to see how many tweets you can gather and what you think the reasonable number of tweets is for this assignment as a data scientist.

brk21 commented 9 years ago

There seems to be an image missing from the assignment. I tried this in two browsers. Could you please link to the image or re-imbed?

http://screencast.com/t/DPBkIQsLwi

Thanks!

neuralinfo commented 9 years ago

The image is there and its accessibility is checked from two different browsers. If you click on the missing image icon, are you able to see the image?

brk21 commented 9 years ago

I was not able originally, but clicking now reveals it. Thank you!

cu8blank commented 9 years ago

When trying to use Scrapy on the Twitter Advanced Search results website, it only returns 20 results. How do you get Scrapy to return more results?

neuralinfo commented 9 years ago

Probably you are only processing the first page of the search results.

makkenned commented 9 years ago

It occurs to me (partially because I'm taking W231 alongside this class) that running a Tweet scraper may be against the Twitter TOS if we don't have prior approval. I don't want to put the MIDS program or anyone else in a potentially awkward position so I figured I'd ask. Do we have approval from Twitter to scrape this data, or is there something I'm missing?

neuralinfo commented 9 years ago

You are getting a sample of twitter data though web interface which is publicly available to anyone. You are not scraping the whole available data. Besides, you are not using it for business services. However, based on Twitter TOS, you are not allowed to publish the scraped tweets. Thats why, we asked you to put the DB that contains the collected tweets on S3. Based on twitter TOS, crawling the Services is permissible if done in accordance with the provisions of the robots.txt. However, some types of scraping requires permission. Twitter has not defined those on their TOS. Interpreting Twitter TOS is tricky and can be a good practice for you. Depending on the technology that you use for this assignment, it is suggested to contact Twitter directly and get their consent for your program. You need to explain that you are getting this small sample of data for a course assignment, provide a brief description of how you plan to do it, and mention that you are not going to make the tweets publicly available and it is not for any business related services. Do not postpone this request and do it as soon as you can. If you encountered any problem in getting their consent, let us know.

jamesgray007 commented 9 years ago

Would you consider the tweet "text" the entire list of characters including @, #, etc. ? I see that the tweet is composed of a

tag and an array of tags so I am wondering if I should concatenate those during the data acquisition to enable the analysis. I see that I will need to capture each of the URL's separately for Analysis part 3.

As I review Analysis part 4 I am trying to interpret "each word" in the tweet text so I would assume that could include hashtags (e.g. USA).

neuralinfo commented 9 years ago

If you look at the note section of the assignment, it says: _Only consider english tweets and for simplicity, get rid of all punctuation, i.e., _any character other than a to z and space. As you mentioned, you need to to capture each of the URL's separately for analysis part 3. For analysis part 4, each word in the tweets text should be considered regardless of where it appears. The only exception to this are the words that are part of the URL.

dhavalbhatt commented 9 years ago

I have a question about question 2 under questions in Analysis part of the assignment.

"Draw a table with all the team support hashtags you can find and the number of support messages. What country did get apparently the most support?"

Does this mean we have to store all hashtags occurring with each team hashtag or just find the count of tweets downloaded for each team hashtag? I am storing the team hashtag in the csv for this purpose.

neuralinfo commented 9 years ago

How to get the data required to generate this table is your design decision. The table has two columns: 1) teams hashtags and 2) the number of support messages. Here is an example:

Team's Hashtag Number of Support Messages
USA 200k
CAN 180K
GER 210k
holatuwol commented 9 years ago

Do we need to write map-reduce jobs (similar to what we configured in a previous class), or would it be okay to use the map and reduce functions in Spark (similar to what we did in our most recent class)?

neuralinfo commented 9 years ago

You need to write map-reduce jobs and run them on EMR.

kchoi01 commented 9 years ago

When looking for the top 20 URLs, should we include the URLs of the hashtags?

neuralinfo commented 9 years ago

You should get the URLs tweeted by the users in their tweets. If it is part of their tweets, you should include it.

kchoi01 commented 9 years ago

What I meant is the hashtag itself is also a URL. For example, if the tweet text is "Yay #USA", then #USA is pointing to https://twitter.com/hashtag/usa?src=hash. Should we include this also?

If we include that then most likely the top 20 URLs will be the hashtags.

neuralinfo commented 9 years ago

No such URLs.

nkrishnaswami commented 9 years ago

Is the query intended to be #FIFAWWC AND (code1 OR code2 OR ... )? Or #FIFAWWC OR code1 OR code2 OR ... ? If the latter, there seems to be a lot of noise (non-FIFA WWC content)

neuralinfo commented 9 years ago

The query should be #FIFAWWC AND (code1 OR code2 OR ... ) or tweets that have #FIFAWWC as we are interested in the tweets related to the FIFAWWC and the teams participated in the event.

howardwen commented 9 years ago

How do you bypass the infinite scroll feature on Twitter? When I use inspect element on Chrome, the Request URL changes as you scroll down, so it seems like you would have to iterate over the URL part with "last_note_ts=####". I then pasted the Request URL directly to the browser and it seems to be a JSON file. Are we suppose to use JSON files or am I headed towards the wrong direction?

neuralinfo commented 9 years ago

You need to find out what attribute(s) changes as you scroll down, find the pattern and feed that pattern to your crawler program. You can also inspect the JSON responses and see what attribute(s) changes in the responses as you scroll down and use that attribute(s) to limit the results.

kchoi01 commented 9 years ago

Some tweets use the full country name (e.g. #Canada instead of #CAN). Do we need to get those also?

kchoi01 commented 9 years ago

Also, some have two teams in one hash (e.g. #CANvCHN). Do we need those?

neuralinfo commented 9 years ago

For the sake of simplicity, you can ignore the full country name as well as two teams in one hash

jamesgray007 commented 9 years ago

For Twitter search query: is it possible to construct the entire operator syntax as one statement or are multiple calls required for each permutation (#FIFAWWC #USA, #FIFAWWC #GER)? I am also having trouble making the AND condition work in the Twitter hashtag box as it comes up as OR on the results page.

neuralinfo commented 9 years ago

As we stated above, the query should be #FIFAWWC AND (code1 OR code2 OR ... ) or tweets that have #FIFAWWC as we are interested in the tweets related to the FIFAWWC and the teams participated in the event. The following link may help you to resolve your problem with AND: https://support.twitter.com/articles/71577

StephTruong commented 9 years ago

Use EMR for running your map-reduce tasks and include the configuration of your cluster in the architecture design document.

What is the architecture design document? Can you provide an example?

neuralinfo commented 9 years ago

Information about architecture design document is provided in Appendix II: Assignment Grading Guidelines of the syllabus.

sarmehta88 commented 9 years ago

I have done about 25 trials with running the emr using the console- i tried even different versions of hadoop, but for the most part sticking with the latest emr-4.0 and I keep getting this error: Caused by: java.io.IOException: Cannot run program "s3://sarumehtareducetest/code/mapper.py": error=2, No such file or directory

I have the shebang on my mapper.py and reducer.py and I even tried using the -files argument and put my s3 uri for the mapper and reducer so then -mapper mapper.py -reducer reducer.py

And Im still getting PipeMapOutputThread error, code 1.... Everything works well locally... I am not sure what I am doing wrong?

Thanks

neuralinfo commented 9 years ago

Have you made your mapper.py and reducer.py executable? (i.e chmod +x)

sarmehta88 commented 9 years ago

Yes, I have chmod 777 file but I am uploading my mapper.py and reducer.py to an s3 bucket/codes/ so i don't think it should matter?

neuralinfo commented 9 years ago

Depending on which OS you use, the permissions can be with the uploaded files. What shebang are you using?

neuralinfo commented 9 years ago

If you are on windows, remove the permissions and make sure both of your scripts do not have the carriage return symbols. If they do, you need to stripping them out before uploading to S3

cu8blank commented 9 years ago

I know that this was asked previously, but could you please provide an alternate link to the image in Task #5? It is currently not working and clicking on it doesn't find the image either.

neuralinfo commented 9 years ago

The image is now manually embedded to the assignment. If you still have issue viewing it, you can use the following link: https://github.com/MIDS-W205/Assignments/issues/24

sarmehta88 commented 9 years ago

i am using a Mac and the shebang i used was

!/usr/bin/python

and I also tried it with this #!/usr/bin/env python but no luck

neuralinfo commented 9 years ago

Are you using CLI or web interface?

neuralinfo commented 9 years ago

Send us what you use with --files? The complete command that you run.

sarmehta88 commented 9 years ago

i used both cli and web interface:

for the cli: aws emr create-cluster --steps file://./mysteps.json --ami-version 3.8.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium --name "NLTK Test Cluster" --log-uri s3://sarumehtareducetest/cmdline/ --enable-debugging --tags Name=emr --bootstrap-actions Path=s3://sarumehtareducetest/bootstrap/bootstrap-nltk.sh,Name="Setup NLTK"

where mysteps.json: [ { "Name" : "SaruMAP", "Type": "STREAMING", "ActionOnFailure": "CONTINUE", "Args": [ "-files", "s3://sarumehtareducetest/codes/mapper.py,s3://sarumehtareducetest/codes/reducer.py", "-mapper", "mapper.py", "-reducer", "reducer.py", "-input", " s3://sarumehtareducetest/input/", "-output", " s3://sarumehtareducetest/output777/"

]} ]

And for the web interface: (i copied this from controller.gz) 'hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -files s3://sarumehtareducetest/codes/mapper.py,s3://sarumehtareducetest/codes/reducer.py -mapper mapper.py -reducer reducer.py -input s3://sarumehtareducetest/input/ -output s3://sarumehtareducetest/output777/'

sarmehta88 commented 9 years ago

when I ran it from command line, i read the error logs and all it said was Streaming Failed! as the only line, so not much info about anything... I then check the other files in my log but it did not specify what the error was.

neuralinfo commented 9 years ago

It seems that you have not specified a subdirectory for the output directory when you used the web interface. When you create output777 as an output folder, you need to specify a subfolder in the wizard where you give your output directory. Once you choose your output directory from s3, add a subfolder after that. It should look like s3://sarumehtareducetest/output777/part-1 Note that you do not need to create part-1 since it creates that for you. Try this using the web interface and see whether you still get the streaming job failed error

sarmehta88 commented 9 years ago

I did as you said and still no luck thru web interface as well as the aws emr command line. I now get the error: Error: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1911)

it is still not reading my input because I loaded a very basic mapper.py : (I changed the shebang to this since the other #!/usr/bin/python didn't seem to help)

!/usr/bin/env python

import sys for line in sys.stdin: for word in line.split(): print ("%s\t%i" %(word,1))

*\ the code is indented, just shows up unindented in this comment I noticed also I am getting a 206 status code, which means it is finding the input... INFO [main] com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem: Opening 's3://sarumehtareducetest/input/easytest.txt' for reading 2015-08-10 20:21:46,404 INFO [main] com.amazonaws.latency: StatusCode=[206], ServiceName=[Amazon S3], AWSRequestID=[D730BB5CEFCFC69E], ServiceEndpoint=[https://sarumehtareducetest.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[98.739], HttpRequestTime=[94.476], HttpClientReceiveResponseTime=[91.509], RequestSigningTime=[0.587], ResponseProcessingTime=[1.544], HttpClientSendRequestTime=[0.962],

neuralinfo commented 9 years ago

Do you have the correct permission for reading/writing to your S3 bucket? For Simplicity, you may want to give full access to everyone and see whether that works

neuralinfo commented 9 years ago

Do you have an empty line somewhere or at the end of easytest.txt?