sebastian-c / overflow

An R package to assist people answering R questions on Stack Overflow
14 stars 12 forks source link

Scrape questions, answers, and code blocks from Stack Overflow #15

Open mrdwab opened 11 years ago

mrdwab commented 11 years ago

Not sure how this could be useful, but I needed some distraction yesterday, so I put together these functions: https://gist.github.com/mrdwab/5275438

The main functions are in StackOverflow.R:

Those functions depend on the print methods defined in SOPrintMethods.R and on the helper functions defined in SOPageLoaderHelper.R, so be sure that all three scripts are loaded before testing the functions.

Example usage:

QAList(15332195)
SOCodeBlocks(15332195)
mrdwab commented 11 years ago

@sebastian-c , I'm transferring your comment here since it doesn't seem like there are comment notifications for Github's Gists:


The last function seems like it could be really useful. Perhaps sourcing all code from a question? Consider the case where there are multiple code chunks in a reproducible example and several intermediate objects. I don't have the expertise, but that would also make a pretty neat Firefox/Chrome extension.

One possible improvement might be to read it in with with an HTML parser (XML package?). I don't know much about the problems, but this answer seems to have some adamant opinions.

mrdwab commented 11 years ago

@sebastian-c, I agree that it is pretty horrible to consider using regex to scrape a page, but that usually applies when pages aren't written in a regular way. In the case of SO, the template is pretty easy to parse using regular expressions, at least to get the code blocks and so on. However, as this was an afternoon project, I'm sure that there are better ways.

I'll check out the XML package, but I've only used that in the past to scrape tables. There are tables used for the layout of questions and answers at SO, so it is possible that there might be a more direct solution than what I implemented.

Will keep you updated on any progress.

juba commented 11 years ago

Hi,

For what it's worth (ie not very much), I once wrote a small function to get the content of a given code block on a given SO question. The code is ugly and I don't find it useful, but it uses html parsing, so maybe it could be helpful.

I put the function here :

https://gist.github.com/juba/5299095

mrdwab commented 11 years ago

@juba, I get an error Error in match.fun(FUN) : object 'getChildrenStrings' not found. I'm assuming I need to load library(XML) (which I've done). Anything else?

juba commented 11 years ago

@mrdwab Strange, it works here, and I checked that getChildrenStrings is indeed a function of the XML package. Is your package up to date ? Maybe you can try with XML::getChildrenStrings ?

mrdwab commented 11 years ago

@juba, You're right. The r-cran-xml repo seems to be a little out of date. Installed using install.packages() instead of apt-get and it works on my Ubuntu system.