Closed trinker closed 10 years ago
The problem lies in the qdapDictionaries
synomyms frame that was used: qdapDictionaries::key.syn
in syn
. It will require a re-scrape and formatting.
Here's the word `outstandings's result in Reverso Online Dictionary:
http://dictionary.reverso.net/english-synonyms/outstanding
and the source for scraping purposes:
view-source:http://dictionary.reverso.net/english-synonyms/outstanding
Here's the scraping script I used previously that likely needs to be modified to eliminate antonym tag:
library(RCurl)
library(XML)
library(parallel)
library(qdap)
load("C:/Users/trinker/Dropbox/Public/LIST.RData") #the seed list
head(LIST)
#Parsing and counting functions:
term.count <- qdap:::term.count
#Scraping function:
FUN <- function(x){
url1 <- "http://dictionary.reverso.net/english-synonyms/"
url2 <- x
doc <- htmlTreeParse(paste0(url1, url2), useInternalNodes = TRUE)
ncontent2 <- getNodeSet(doc, "//span[@direction='']//text()")[[1]]
if(xmlToList(ncontent2) != x) {
return("***XX")
}
content <- getNodeSet(doc, "//span[@direction='target']//text()")
ncontent <- getNodeSet(doc, "//span[@class='ellipsis_text']//text()")
content <- content[!unlist(content) %in% unlist(ncontent)]
if (is.null(content)) return(NA)
x <- lapply(content, function(x) Trim(xmlToList(x)))
x <- x[!sapply(x, function(y) y=="")]
words <- unlist(lapply(x, function(x) length(unlist(strsplit(x, "\\s+")))))
commas <- sapply(x, function(x) term.count(x, ","), USE.NAMES=FALSE)
ctw <- commas/words
ctw[words < 3] <- 1
if (sum(ctw > .25) == 0) return("***XX")
y <- x[ctw > .25]
if (length(y) == 1 && y[[1]] == "") return("***XX")
paste(paste("[", seq_len(length(y)), "]", y, sep = "") , collapse = " @@@@ ")
}
#parallel processing the scrape
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("LIST", "Trim", "FUN", "term.count", "htmlTreeParse",
"getNodeSet", "xmlToList"), envir=environment())
L1 <- parLapply(cl, LIST, function(x) {
Sys.sleep(.75)
try(FUN(x))
})
stopCluster(cl) #stop the cluster
names(L1) <- LIST
The Antonym
doesn't appear to be a tag but a header:
...d="ID0ETD" style="color:#0;" direction="">Antonyms<span...
This makes parsing more difficult. A possibility is to split right away on Antonyms<span
. Take the first break and parse that.
can give some examples of what I meant - the last result for each word seems to be the antonyms instead of synonyms.