Closed zachmayer closed 10 years ago
Yeah I thought about as specific rm_html_tag
but fear the scorn of those who shun regex use on HTML: http://stackoverflow.com/a/1732454/1000343
I think qdapRegex
's rm_angle
is something similar enough that you'll find it useful for such needs without provoking the rebuke of HTML parsing purists.:
x <- paste(readLines("http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags")[100:1000], collapse=" ")
rm_angle(x)
If this does suit your needs you could close the issue. If not could you explain more as to why rm_angle
will not work.
Haha, true. But if you're looking for a quick-and-dirty solution for string normalization in R, do you have any options other than a regex for stripping html tags?
And thanks for pointing out rm_angle
. It pretty much does exactly what I want. Smart of you to call it rm_angle
and make it clear that it's just removing the angle brackets... so if someone REALLY wants to use regex to strip some html tags, they can use rm_angle
as a hack.
do you have any options other than a regex for stripping html tags
Yes the XML package has functionality to do this. While it's maybe less intuitive if you're used to Regular Expressions the XML parser is pretty nice and a bit more consistent and flexible.
Same problem as above with a parser:
library(RCurl)
library(XML)
URL <- "http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags"
doc3 <- htmlTreeParse(URL, useInternalNodes = TRUE) # Store the HTML document as parsed XML
nodes <- getNodeSet(doc3, "//text()")
nodes
Here's me struggling through HTML parsing: http://www.talkstats.com/showthread.php/26153-Still-trying-to-learn-to-scrape?highlight=still+learning+to+scrape
Hmmm, I'll have to try that. Is the XML parser "vectorized" or would I have to loop over the set of strings I want to normalize?
I think sapply
would work. Not truly vectorized but not a loop either. I think there's examples where I use sapply
in that link
Cool, thank you!
I often use gsub('<[^>]*>', '', a_long_string_with_html) to remove html tags. Might be nice to have that (or a similar regex) in this package.