Open neuralinfo opened 9 years ago
The "until" parameter of the Twitter search API says it's limited to 7 days, but the assignment asks us to find tweets starting from over a month ago. Just wanted to make sure I wasn't missing some obvious way to get the required date range? https://dev.twitter.com/rest/reference/get/search/tweets
As explained in the second paragraph of the data collection task, you need to come up with a strategy to get the historical data that is not available through the twitter API. One possible way is to scrape the twitter Advanced Search results for the requested time frame.
2 Questions:
1). Is the choice of what information to store in the csv file our design decision? That is I can say I am only interested in the user and the time and tweet text and just dump the rest (which will then inform what kinds of queries I can run).
2). For the tweet hashtags are we gathering the tweets that have #FIFWWC or any of the team hashtags? So this would be 24 team hashtags plus FIFWWC
Similar to Ted's question: can you provide a ballpark for a "reasonable" number of tweets?
@dunmireg 1) you can store only the information that you need for the analysis part. 2) FIFWWC plus 24 team hashtags.
@nickhamlin: There is no minimum requirement for the number of tweets. You should do your best to acquire as much as tweets possible. We'd like to see how many tweets you can gather and what you think the reasonable number of tweets is for this assignment as a data scientist.
There seems to be an image missing from the assignment. I tried this in two browsers. Could you please link to the image or re-imbed?
http://screencast.com/t/DPBkIQsLwi
Thanks!
The image is there and its accessibility is checked from two different browsers. If you click on the missing image icon, are you able to see the image?
I was not able originally, but clicking now reveals it. Thank you!
When trying to use Scrapy on the Twitter Advanced Search results website, it only returns 20 results. How do you get Scrapy to return more results?
Probably you are only processing the first page of the search results.
It occurs to me (partially because I'm taking W231 alongside this class) that running a Tweet scraper may be against the Twitter TOS if we don't have prior approval. I don't want to put the MIDS program or anyone else in a potentially awkward position so I figured I'd ask. Do we have approval from Twitter to scrape this data, or is there something I'm missing?
You are getting a sample of twitter data though web interface which is publicly available to anyone. You are not scraping the whole available data. Besides, you are not using it for business services. However, based on Twitter TOS, you are not allowed to publish the scraped tweets. Thats why, we asked you to put the DB that contains the collected tweets on S3. Based on twitter TOS, crawling the Services is permissible if done in accordance with the provisions of the robots.txt. However, some types of scraping requires permission. Twitter has not defined those on their TOS. Interpreting Twitter TOS is tricky and can be a good practice for you. Depending on the technology that you use for this assignment, it is suggested to contact Twitter directly and get their consent for your program. You need to explain that you are getting this small sample of data for a course assignment, provide a brief description of how you plan to do it, and mention that you are not going to make the tweets publicly available and it is not for any business related services. Do not postpone this request and do it as soon as you can. If you encountered any problem in getting their consent, let us know.
Would you consider the tweet "text" the entire list of characters including @, #, etc. ? I see that the tweet is composed of a
As I review Analysis part 4 I am trying to interpret "each word" in the tweet text so I would assume that could include hashtags (e.g. USA).
If you look at the note section of the assignment, it says: _Only consider english tweets and for simplicity, get rid of all punctuation, i.e., _any character other than a to z and space. As you mentioned, you need to to capture each of the URL's separately for analysis part 3. For analysis part 4, each word in the tweets text should be considered regardless of where it appears. The only exception to this are the words that are part of the URL.
I have a question about question 2 under questions in Analysis part of the assignment.
"Draw a table with all the team support hashtags you can find and the number of support messages. What country did get apparently the most support?"
Does this mean we have to store all hashtags occurring with each team hashtag or just find the count of tweets downloaded for each team hashtag? I am storing the team hashtag in the csv for this purpose.
How to get the data required to generate this table is your design decision. The table has two columns: 1) teams hashtags and 2) the number of support messages. Here is an example:
Team's Hashtag | Number of Support Messages |
---|---|
USA | 200k |
CAN | 180K |
GER | 210k |
Do we need to write map-reduce jobs (similar to what we configured in a previous class), or would it be okay to use the map and reduce functions in Spark (similar to what we did in our most recent class)?
You need to write map-reduce jobs and run them on EMR.
When looking for the top 20 URLs, should we include the URLs of the hashtags?
You should get the URLs tweeted by the users in their tweets. If it is part of their tweets, you should include it.
What I meant is the hashtag itself is also a URL. For example, if the tweet text is "Yay #USA", then #USA is pointing to https://twitter.com/hashtag/usa?src=hash. Should we include this also?
If we include that then most likely the top 20 URLs will be the hashtags.
No such URLs.
Is the query intended to be #FIFAWWC AND (code1 OR code2 OR ... )? Or #FIFAWWC OR code1 OR code2 OR ... ? If the latter, there seems to be a lot of noise (non-FIFA WWC content)
The query should be #FIFAWWC AND (code1 OR code2 OR ... ) or tweets that have #FIFAWWC as we are interested in the tweets related to the FIFAWWC and the teams participated in the event.
How do you bypass the infinite scroll feature on Twitter? When I use inspect element on Chrome, the Request URL changes as you scroll down, so it seems like you would have to iterate over the URL part with "last_note_ts=####". I then pasted the Request URL directly to the browser and it seems to be a JSON file. Are we suppose to use JSON files or am I headed towards the wrong direction?
You need to find out what attribute(s) changes as you scroll down, find the pattern and feed that pattern to your crawler program. You can also inspect the JSON responses and see what attribute(s) changes in the responses as you scroll down and use that attribute(s) to limit the results.
Some tweets use the full country name (e.g. #Canada instead of #CAN). Do we need to get those also?
Also, some have two teams in one hash (e.g. #CANvCHN). Do we need those?
For the sake of simplicity, you can ignore the full country name as well as two teams in one hash
For Twitter search query: is it possible to construct the entire operator syntax as one statement or are multiple calls required for each permutation (#FIFAWWC #USA, #FIFAWWC #GER)? I am also having trouble making the AND condition work in the Twitter hashtag box as it comes up as OR on the results page.
As we stated above, the query should be #FIFAWWC AND (code1 OR code2 OR ... ) or tweets that have #FIFAWWC as we are interested in the tweets related to the FIFAWWC and the teams participated in the event. The following link may help you to resolve your problem with AND: https://support.twitter.com/articles/71577
What is the architecture design document? Can you provide an example?
Information about architecture design document is provided in Appendix II: Assignment Grading Guidelines of the syllabus.
I have done about 25 trials with running the emr using the console- i tried even different versions of hadoop, but for the most part sticking with the latest emr-4.0 and I keep getting this error: Caused by: java.io.IOException: Cannot run program "s3://sarumehtareducetest/code/mapper.py": error=2, No such file or directory
I have the shebang on my mapper.py and reducer.py and I even tried using the -files argument and put my s3 uri for the mapper and reducer so then -mapper mapper.py -reducer reducer.py
And Im still getting PipeMapOutputThread error, code 1.... Everything works well locally... I am not sure what I am doing wrong?
Thanks
Have you made your mapper.py and reducer.py executable? (i.e chmod +x)
Yes, I have chmod 777 file but I am uploading my mapper.py and reducer.py to an s3 bucket/codes/ so i don't think it should matter?
Depending on which OS you use, the permissions can be with the uploaded files. What shebang are you using?
If you are on windows, remove the permissions and make sure both of your scripts do not have the carriage return symbols. If they do, you need to stripping them out before uploading to S3
I know that this was asked previously, but could you please provide an alternate link to the image in Task #5? It is currently not working and clicking on it doesn't find the image either.
The image is now manually embedded to the assignment. If you still have issue viewing it, you can use the following link: https://github.com/MIDS-W205/Assignments/issues/24
i am using a Mac and the shebang i used was
and I also tried it with this #!/usr/bin/env python but no luck
Are you using CLI or web interface?
Send us what you use with --files? The complete command that you run.
i used both cli and web interface:
for the cli: aws emr create-cluster --steps file://./mysteps.json --ami-version 3.8.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m1.medium InstanceGroupType=CORE,InstanceCount=2,InstanceType=m1.medium --name "NLTK Test Cluster" --log-uri s3://sarumehtareducetest/cmdline/ --enable-debugging --tags Name=emr --bootstrap-actions Path=s3://sarumehtareducetest/bootstrap/bootstrap-nltk.sh,Name="Setup NLTK"
where mysteps.json: [ { "Name" : "SaruMAP", "Type": "STREAMING", "ActionOnFailure": "CONTINUE", "Args": [ "-files", "s3://sarumehtareducetest/codes/mapper.py,s3://sarumehtareducetest/codes/reducer.py", "-mapper", "mapper.py", "-reducer", "reducer.py", "-input", " s3://sarumehtareducetest/input/", "-output", " s3://sarumehtareducetest/output777/"
]} ]
And for the web interface: (i copied this from controller.gz) 'hadoop jar /home/hadoop/contrib/streaming/hadoop-streaming.jar -files s3://sarumehtareducetest/codes/mapper.py,s3://sarumehtareducetest/codes/reducer.py -mapper mapper.py -reducer reducer.py -input s3://sarumehtareducetest/input/ -output s3://sarumehtareducetest/output777/'
when I ran it from command line, i read the error logs and all it said was Streaming Failed! as the only line, so not much info about anything... I then check the other files in my log but it did not specify what the error was.
It seems that you have not specified a subdirectory for the output directory when you used the web interface. When you create output777 as an output folder, you need to specify a subfolder in the wizard where you give your output directory. Once you choose your output directory from s3, add a subfolder after that. It should look like s3://sarumehtareducetest/output777/part-1 Note that you do not need to create part-1 since it creates that for you. Try this using the web interface and see whether you still get the streaming job failed error
I did as you said and still no luck thru web interface as well as the aws emr command line. I now get the error: Error: java.lang.StringIndexOutOfBoundsException: String index out of range: -1 at java.lang.String.substring(String.java:1911)
it is still not reading my input because I loaded a very basic mapper.py : (I changed the shebang to this since the other #!/usr/bin/python didn't seem to help)
import sys for line in sys.stdin: for word in line.split(): print ("%s\t%i" %(word,1))
*\ the code is indented, just shows up unindented in this comment I noticed also I am getting a 206 status code, which means it is finding the input... INFO [main] com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem: Opening 's3://sarumehtareducetest/input/easytest.txt' for reading 2015-08-10 20:21:46,404 INFO [main] com.amazonaws.latency: StatusCode=[206], ServiceName=[Amazon S3], AWSRequestID=[D730BB5CEFCFC69E], ServiceEndpoint=[https://sarumehtareducetest.s3.amazonaws.com], HttpClientPoolLeasedCount=0, RequestCount=1, HttpClientPoolPendingCount=0, HttpClientPoolAvailableCount=1, ClientExecuteTime=[98.739], HttpRequestTime=[94.476], HttpClientReceiveResponseTime=[91.509], RequestSigningTime=[0.587], ResponseProcessingTime=[1.544], HttpClientSendRequestTime=[0.962],
Do you have the correct permission for reading/writing to your S3 bucket? For Simplicity, you may want to give full access to everyone and see whether that works
Do you have an empty line somewhere or at the end of easytest.txt?
Write your issues/questions about assignment 4 here