ropensci / ozunconf18

repository for the rOpenSci ozunconference 2018
31 stars 7 forks source link

`syn`: a package for generating synonyms and antonyms 📘 🔡 #9

Open njtierney opened 5 years ago

njtierney commented 5 years ago

What is says on the tin!

I've had this idea for a little while, mainly to stop me from going to google to look for synonyms - I haven't made any progress, but a stub of a package is here: https://github.com/njtierney/syn

The goal of syn is to provide two main functions:

There are other packages that do this, but they usually do this in the context of other text-related work.

In terms of applications, I would use this all the time to output a set of (syn/ant)onyms for words in the terminal, but I imagine it could also be useful for type of text analysis where you might want to search for similar words? I have 0 experience with text analysis, so perhaps there are better tools for that already.

ekothe commented 5 years ago

@njtierney So you're thinking of something where you'd provide a word and the package would report the syn or ant based on some pre-specified dictionary?

So ant("good") would return [1] bad [2] wicked?

njtierney commented 5 years ago

Yup! Exactly that! I think that the trick is finding a good quality open source thesaurus that can be downloaded or provided with the package. This would mean that we avoid internet API calls so it would be fast, and not require an API key or internet.

But yes, I imagine it would be something like this:

syn("good")
[1] great fantastic excellent happy
Lingtax commented 5 years ago

Wouldn’t it be preferable to return a vector? That might make it easier on possible secondary arguments (e.g. return a variable number of values, and/or select return values based on default order/randomly). Just a thought.

RPanczak commented 5 years ago

Really cool idea. In a long run that could be a useful thing for editing longer prose inside markdown perhaps?

Are you aware of any publicly available data that could be used for that? Or API?

markdly commented 5 years ago

Nice idea. Would something like the Wiktionary Thesaurus be suitable as a data source?

A while back I had some mixed success downloading quotes for word lists from the Quotations Wiktionary. I imagine this might be similar to accessing the thesaurus information.

markdly commented 5 years ago

FWIW, here's the old code I used for downloading quotes in case it's useful

####
# Wiktionary quotes
####
# Description: Obtain phrases from wiktionary for given words.
# References:
# https://en.wiktionary.org/wiki/Wiktionary:Quotations
# https://en.wiktionary.org/wiki/Wiktionary:Entry_layout#Example_sentences

library(httr)
library(stringr)

wiki_quote <- function(some_word) {
  some_url <- GET(paste0("https://en.wiktionary.org/w/index.php?title=", some_word, "&action=raw"))
  some_text <- content(some_url, "text")  # e.g. text content of a wiktionary page 
  some_pattern <- paste0("#:[^\n]+?'''", some_word, "'''.+?\n")  # e.g. a wiktionary quote "#: There was a dark storm brewing.\n" 
  raw_match <- regexpr(some_pattern, some_text)
  if (nchar(some_text)  == 0) return(NA)
  if (all(raw_match[[1]] == -1)) return(NA)

  matched_substrings <- regmatches(some_text, raw_match)
  lapply(matched_substrings, tidy_quote)
}

tidy_quote <- function(quote) {
  temp <- str_replace(quote, "\\{\\{ux\\|en\\|",  "")
  temp <- str_replace(temp, "\\}\\}", "")
  temp <- str_replace(temp, "#:", "")
  trimws(temp) 
}

wiki_quote("storm")
#> [[1]]
#> [1] "''The proposed reforms have led to a political '''storm'''.''"
wiki_quote("sunshine")
#> [[1]]
#> [1] "We were warmed by the bright '''sunshine'''."
wiki_quote("hufflepuff")  # nonsense word - should retrun NA
#> [1] NA

Created on 2018-11-09 by the reprex package (v0.2.0).