ubcecon / computing_and_datascience

Sandbox and workspace for computing and datascience infrastructure and course materials.
MIT License
8 stars 12 forks source link

Lecture on web scraping and (perhaps) text #87

Open jlperla opened 5 years ago

jlperla commented 5 years ago

The lecture is planned for November 14th

My goal is primarily to help people realize that scraping the web and doing text analysis is Not scary! I don't want fear of it to be a reason they are not willing to get creative in the creation of new sources of data.

You guys can play around with the directory https://github.com/ubcecon/computing_and_datascience/tree/master/R_sandbox etc.

JasmineHao commented 5 years ago

Useful gadget for analyzing HTML https://selectorgadget.com/

jlperla commented 5 years ago

Also see https://www.datacamp.com/community/tutorials/r-web-scraping-rvest and https://ropensci.org/tutorials/rselenium_tutorial/

JasmineHao commented 5 years ago

Something I think could be common for a class of websites. I cannot webscrape https://www.ratebeer.com/ using Rvest tool, perhaps due to the cookies. Could leave for later research.

jlperla commented 5 years ago

Yeah, I think that there are a large number of sites where you really need to run selenium... it emulates both cookies and runs the javascript (at which point it can then be scraped by the other tools). It would be great if we should show a very minimal example of rselenium, if it is relatively easy to show.

Do you guys want to grab the R scraping textbooks from my office?

chiyahn commented 5 years ago

Relevant repos I have found in Github so far:

Useful R packages for data cleaning:

Fun examples (not necessarily economics):

chiyahn commented 5 years ago

Jasmine and I had some discussion about how the lecture can be delivered:

jlperla commented 5 years ago

I think I really want to emphasize the webscraping more rather than talking about tidyverse transformations.

The goal should be about building people's confidence that they can (1) scrape numerical data from the web and (2) could work with text as data.

It is more important for me to show the tools than anything else.

jlperla commented 5 years ago

If all we did was give a 1.25 hour presentation on how to scrape a couple of websites, I would be happy.

To be clear, we do not need to have an economic application of getting the data, just that we should be scraping data that could be applied to economic problems. For example, you could even take a world-bank or whatever page that has a "download" button and say "lets pretend it didn't have that button", I will show you how you could have gotten the data anyways.

arnavs commented 5 years ago

Just so it's not forgotten, I wanted to link to the notes Jasmine Yang produced on this from a few months ago. If the issue has evolved since then, please feel free to disregard.

https://github.com/ubcecon/computing_and_datascience/blob/master/python_sandbox/Web-Scraping.md

chiyahn commented 5 years ago

Simple tutorial on using rvest I wrote yesterday: https://github.com/chiyahn/notes/blob/master/programming/data-mining/rvest/text-mining-with-rvest.md

schrimpf commented 5 years ago

Relatedly, I have attached code that scrapes the AER website to look at programming language usage. Patrick independently did the same thing, and his results are going to be part of the next AER annual report. His code is perhaps a bit nicer https://github.com/pbaylis/econ-program-usage

On Wed, Oct 31, 2018 at 2:25 PM Chiyoung Ahn notifications@github.com wrote:

Simple tutorial on using rvest I wrote yesterday: https://github.com/chiyahn/notes/blob/master/programming/data-mining/rvest/text-mining-with-rvest.md

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ubcecon/computing_and_datascience/issues/87#issuecomment-434853304, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ-4vPG2C_--hskRDuhG1ka9EblmMVOaks5uqhVCgaJpZM4X8zdZ .

jlperla commented 5 years ago

@pbaylis Can you give these guys your code to prepare as an Rmd file? I think it would be a nice example code to give people.

That said, I want to stress in class an example with data that is not "inside economics" so they don't think of this stuff as just a novelty.

pbaylis commented 5 years ago

I don't think it's actually all that clean but sure. Here's the repo. One note - I keep this code in a private repo because I don't want to be seen as encouraging people all over the internet to hammer the AER website (although thanks to Paul, it's considerably more gentle than it could have been). So it's important to talk about being a good scraping citizen when you do this sort of thing: test on a small subset until you know it works, don't parallelize downloading code, and include sleep time when downloading lots of large files or a bunch of websites (which, honestly, my code should do more of).

econ-program-usage-master.zip

jlperla commented 5 years ago

@pbaylis Alright Debbie Downer. You environmental economists spend too much time thinking about ethics and the tragedy-of-the-commons. The optimal non-cooperative strategy here is slash-and-burn webscraping.

But we will pass on your bleeding heart messages of being good scraping citizens along with the code!

JasmineHao commented 5 years ago

The rsdriver seems to have a connection issue, so when dealing with cookies, it seems like we need to install docker to run RSelenium https://stackoverflow.com/questions/45395849/cant-execute-rsdriver-connection-refused

jlperla commented 5 years ago

We have given the students basic docker instruction, so we could conceivably pass on the RSelenium example for them...

But I don't think we should use that in the core demo in class (just supplementary links if they want to do further). Let's keep things simple. Also, it is more important to me that we show clean simple examples than fancy stuff, if that stuff is tricky to setyo.

Also @chiyahn and @jasminefish000 I want to make sure you guys are talking and planning things out together. If you are both off doing your own things for this lecture, there might be a lot of replication of effort.

schrimpf commented 5 years ago

For what it's worth, I've had no problem using rselenium without docker on Linux.

On Sun, Nov 4, 2018, 8:00 AM Jesse Perla <notifications@github.com wrote:

We have given the students basic docker instruction, so we could conceivably pass on the RSelenium example for them...

But I don't think we should use that in the core demo in class (just supplementary links if they want to do further). Let's keep things simple. Also, it is more important to me that we show clean simple examples than fancy stuff, if that stuff is tricky to setyo.

Also @chiyahn https://github.com/chiyahn and @jasminefish000 https://github.com/jasminefish000 I want to make sure you guys are talking and planning things out together. If you are both off doing your own things for this lecture, there might be a lot of replication of effort.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ubcecon/computing_and_datascience/issues/87#issuecomment-435681692, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ-4vIxsLnLPBbWkrsKri3LEht3sIpGZks5urw87gaJpZM4X8zdZ .