trinker / qdap

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis
http://cran.us.r-project.org/web/packages/qdap/index.html
175 stars 44 forks source link

`syn` gives antonyms as well. #190

Closed trinker closed 10 years ago

trinker commented 10 years ago

I have a question about the function "syn", seems sometimes it returns antomyms of a word as the last element in the result list, and there is no way to tell when that happens without looking at the results. I am trying to automate a mapping between two lists of words using synomyms and this might cause bias. So I was wondering if there is a way to get around of this. -Jingjing Zou-

syn(c("outstanding", "memorable", "hilarious", "relish", "excellent", 
   "fantastic", "brisk", "perfectly", "offbeat"))   

can give some examples of what I meant - the last result for each word seems to be the antonyms instead of synonyms.

trinker commented 10 years ago

The problem lies in the qdapDictionaries synomyms frame that was used: qdapDictionaries::key.syn in syn. It will require a re-scrape and formatting.

Here's the word `outstandings's result in Reverso Online Dictionary:

http://dictionary.reverso.net/english-synonyms/outstanding

and the source for scraping purposes:

view-source:http://dictionary.reverso.net/english-synonyms/outstanding

trinker commented 10 years ago

Here's the scraping script I used previously that likely needs to be modified to eliminate antonym tag:

library(RCurl)
library(XML)
library(parallel)
library(qdap)
load("C:/Users/trinker/Dropbox/Public/LIST.RData") #the seed list
head(LIST)

#Parsing and counting functions:
term.count <- qdap:::term.count

#Scraping function:
FUN <- function(x){

    url1 <- "http://dictionary.reverso.net/english-synonyms/"
    url2 <- x
    doc <- htmlTreeParse(paste0(url1, url2), useInternalNodes = TRUE)
    ncontent2 <- getNodeSet(doc, "//span[@direction='']//text()")[[1]]
    if(xmlToList(ncontent2) != x) {
        return("***XX")
    }

    content <- getNodeSet(doc, "//span[@direction='target']//text()")
    ncontent <- getNodeSet(doc, "//span[@class='ellipsis_text']//text()")
    content <- content[!unlist(content) %in% unlist(ncontent)]

    if (is.null(content)) return(NA)

    x <- lapply(content, function(x) Trim(xmlToList(x)))
    x <- x[!sapply(x, function(y) y=="")]
    words <- unlist(lapply(x, function(x) length(unlist(strsplit(x, "\\s+")))))
    commas <- sapply(x, function(x) term.count(x, ","), USE.NAMES=FALSE)
    ctw <- commas/words
    ctw[words < 3] <- 1
    if (sum(ctw > .25) == 0) return("***XX")

    y <- x[ctw > .25]
    if (length(y) == 1 && y[[1]] == "") return("***XX")

    paste(paste("[", seq_len(length(y)), "]", y, sep = "") , collapse = " @@@@ ")
}

#parallel processing the scrape
cl <- makeCluster(mc <- getOption("cl.cores", detectCores()))
clusterExport(cl=cl, varlist=c("LIST", "Trim", "FUN", "term.count", "htmlTreeParse",
    "getNodeSet", "xmlToList"), envir=environment())

L1 <- parLapply(cl, LIST, function(x) {
    Sys.sleep(.75)
    try(FUN(x))
})

stopCluster(cl) #stop the cluster

names(L1) <- LIST
trinker commented 10 years ago

The Antonym doesn't appear to be a tag but a header:

...d="ID0ETD" style="color:#0;" direction="">Antonyms<span...

This makes parsing more difficult. A possibility is to split right away on Antonyms<span. Take the first break and parse that.