trinker / qdapRegex

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis.
50 stars 4 forks source link

rm_html tag #3

Closed zachmayer closed 10 years ago

zachmayer commented 10 years ago

I often use gsub('<[^>]*>', '', a_long_string_with_html) to remove html tags. Might be nice to have that (or a similar regex) in this package.

trinker commented 10 years ago

Yeah I thought about as specific rm_html_tag but fear the scorn of those who shun regex use on HTML: http://stackoverflow.com/a/1732454/1000343

I think qdapRegex's rm_angle is something similar enough that you'll find it useful for such needs without provoking the rebuke of HTML parsing purists.:

x <- paste(readLines("http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags")[100:1000], collapse=" ")

rm_angle(x)

If this does suit your needs you could close the issue. If not could you explain more as to why rm_angle will not work.

zachmayer commented 10 years ago

Haha, true. But if you're looking for a quick-and-dirty solution for string normalization in R, do you have any options other than a regex for stripping html tags?

And thanks for pointing out rm_angle. It pretty much does exactly what I want. Smart of you to call it rm_angle and make it clear that it's just removing the angle brackets... so if someone REALLY wants to use regex to strip some html tags, they can use rm_angle as a hack.

trinker commented 10 years ago

do you have any options other than a regex for stripping html tags

Yes the XML package has functionality to do this. While it's maybe less intuitive if you're used to Regular Expressions the XML parser is pretty nice and a bit more consistent and flexible.

Same problem as above with a parser:

library(RCurl)
library(XML)

URL <- "http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags"
doc3   <- htmlTreeParse(URL, useInternalNodes = TRUE)  # Store the HTML document as parsed XML
nodes <- getNodeSet(doc3, "//text()")
nodes

Here's me struggling through HTML parsing: http://www.talkstats.com/showthread.php/26153-Still-trying-to-learn-to-scrape?highlight=still+learning+to+scrape

zachmayer commented 10 years ago

Hmmm, I'll have to try that. Is the XML parser "vectorized" or would I have to loop over the set of strings I want to normalize?

trinker commented 10 years ago

I think sapply would work. Not truly vectorized but not a loop either. I think there's examples where I use sapply in that link

zachmayer commented 10 years ago

Cool, thank you!