Closed simongray closed 8 years ago
issue #4 is continued here
As mentioned in issue #17 TokensRegex and RegexNEX are a good starting point for finding alternative entities.
One source of entities is this huge reddit dataset available on BigQuery: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05
Reddit post: https://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/
For example, the top 1000 list of subreddits by subscribers might be a starting point for entities, as subreddits are often used as entities on reddit.
It seems that RegexNER is too simple for my use case (e.g. to match "/r/
Using TokensRegexNER to get standard RegexNER with added actual regex: http://slidegur.com/doc/263860/tokensregexoverview
Seems like an interesting range of entities.
Unfortunately, most of the documenation available is just javadoc stuff.
The best quick ref: http://nlp.stanford.edu/software/tokensregex.html#TokensRegexPatterns
Root of javadoc: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/package-summary.html
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/CoreMapExpressionExtractor.html http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/types/Expressions.html http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/TokenSequencePattern.html
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/Env.html
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/TokensRegexAnnotator.html http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/TokensRegexNERAnnotator.html
https://github.com/stanfordnlp/CoreNLP/tree/master/doc/tokensregex/examples https://github.com/stanfordnlp/CoreNLP/tree/master/src/edu/stanford/nlp/ling/tokensregex/demo https://github.com/stanfordnlp/CoreNLP/tree/master/src/edu/stanford/nlp/ling/tokensregex/demo/rules
http://nlp.stanford.edu/software/tokensregex.shtml#TokensRegexPatterns
FOR FINDING WORDS THAT END WITH SOMETHING: http://www.litscape.com/word_tools/ends_with.php
Maybe also take a look at this: http://nlp.stanford.edu/software/patternslearning.html
I can use the BigQuery stuff to get a batch of headlines and then use something like the important ngrams method from NLTK to extract entities.
I'll be using the more general approach with subjects as described in #23 instead of the focus on named entities.
That could be used an a complement or alternative to regular named entities.