ropensci / textworkshop18

9 stars 6 forks source link

Wish list #6

Open patperry opened 6 years ago

patperry commented 6 years ago

What software (function, library, program) doesn't exist yet but you wish did? What could you do with this? Why do current offerings not meet your needs?

trinker commented 6 years ago

This is more of a task based thinking around this question rather than specific tools or libraries. Just throwing raw thoughts out there on this one for later. I'm likely wrong on many of these and would like to hear many times "You are wrong such and such a tool does this really well..."

  1. Text visualization [systematic; not one offs, real thought has went into it]. It'd be nice to see methods for particular objects for plotting in a way that is similar to Pandas out of the box plotting
  2. Domain specific dictionaries store and interface: [A few R examples: termco, liwcalike, quanteda.dictionaries, lexicon, misinfo, but more systematic and better tested]
  3. Likewise....Interface to domain specific corpus [with demographics/metadata: e.g., ages, location, gender, sex, education level, employment, etc.]
  4. Text growth [learning in a domain or improvement in writing]; i.e., given 2 texts can we show improvement in knowledge or in writing ability
  5. Topic Modeling auto Topic Category Assignment [from what I've seen topics still have to be human generated]
  6. Sarcasm detection
  7. Inappropriate language and/or fake news language detection [suicide, depression, racism, lies, vulgarity, deceitfulness]; currently, I see mostly dictionary lookups of "Bad Words"
  8. Tools for parsing predictable text structures/scrpts/templates: e.g., resume, schedules, transcripts, newspapers, course syllabus
  9. Auto-encoding [fix -> to plain text]; maybe there's something that does this already but wouldn't it be nice if there was an EASY button
  10. Expert rules regex modeling [you've gotta prime the machine learning pump somehow; giving users "what we think" allows for Cunningham's law to be invoked and a chance to get coded data]
  11. Sentiment training/testing corpus across many domains
  12. Parallel operations (maybe they exist but they aren't well advertised and from what I see not OS independent)