Open mrdwab opened 11 years ago
@sebastian-c , I'm transferring your comment here since it doesn't seem like there are comment notifications for Github's Gists:
The last function seems like it could be really useful. Perhaps sourcing all code from a question? Consider the case where there are multiple code chunks in a reproducible example and several intermediate objects. I don't have the expertise, but that would also make a pretty neat Firefox/Chrome extension.
One possible improvement might be to read it in with with an HTML parser (XML package?). I don't know much about the problems, but this answer seems to have some adamant opinions.
@sebastian-c, I agree that it is pretty horrible to consider using regex to scrape a page, but that usually applies when pages aren't written in a regular way. In the case of SO, the template is pretty easy to parse using regular expressions, at least to get the code blocks and so on. However, as this was an afternoon project, I'm sure that there are better ways.
I'll check out the XML package, but I've only used that in the past to scrape tables. There are tables used for the layout of questions and answers at SO, so it is possible that there might be a more direct solution than what I implemented.
Will keep you updated on any progress.
Hi,
For what it's worth (ie not very much), I once wrote a small function to get the content of a given code block on a given SO question. The code is ugly and I don't find it useful, but it uses html parsing, so maybe it could be helpful.
I put the function here :
@juba, I get an error Error in match.fun(FUN) : object 'getChildrenStrings' not found
. I'm assuming I need to load library(XML)
(which I've done). Anything else?
@mrdwab Strange, it works here, and I checked that getChildrenStrings
is indeed a function of the XML package. Is your package up to date ? Maybe you can try with XML::getChildrenStrings
?
@juba, You're right. The r-cran-xml
repo seems to be a little out of date. Installed using install.packages()
instead of apt-get
and it works on my Ubuntu system.
Not sure how this could be useful, but I needed some distraction yesterday, so I put together these functions: https://gist.github.com/mrdwab/5275438
The main functions are in StackOverflow.R:
QAList()
returns a custom-formatted list of the question and related answersSOCodeBlocks()
returns a custom-formatted list of the code blocks within the question and answersThose functions depend on the print methods defined in SOPrintMethods.R and on the helper functions defined in SOPageLoaderHelper.R, so be sure that all three scripts are loaded before testing the functions.
Example usage: