simongray / StatementAnnotator

Custom annotator for Stanford CoreNLP that annotates sentences with the underlying statements contained within them.
4 stars 0 forks source link

What are various alternative entities or topics? #15

Closed simongray closed 8 years ago

simongray commented 8 years ago

That could be used an a complement or alternative to regular named entities.

simongray commented 8 years ago

issue #4 is continued here

simongray commented 8 years ago

As mentioned in issue #17 TokensRegex and RegexNEX are a good starting point for finding alternative entities.

simongray commented 8 years ago

One source of entities is this huge reddit dataset available on BigQuery: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2015_05

Reddit post: https://www.reddit.com/r/bigquery/comments/3cej2b/17_billion_reddit_comments_loaded_on_bigquery/

For example, the top 1000 list of subreddits by subscribers might be a starting point for entities, as subreddits are often used as entities on reddit.

simongray commented 8 years ago

It seems that RegexNER is too simple for my use case (e.g. to match "/r/" for subreddit refs). After looking at the code, it seems like it only supports exact string matches and some basic logic rules.

simongray commented 8 years ago

Using TokensRegexNER to get standard RegexNER with added actual regex: http://slidegur.com/doc/263860/tokensregexoverview

simongray commented 8 years ago

Fortune 500 maybe?

Seems like an interesting range of entities.

simongray commented 8 years ago

TokensRegex links

Unfortunately, most of the documenation available is just javadoc stuff.

The best quick ref: http://nlp.stanford.edu/software/tokensregex.html#TokensRegexPatterns

Root of javadoc: http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/package-summary.html

Explaining syntax

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/CoreMapExpressionExtractor.html http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/types/Expressions.html http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/TokenSequencePattern.html

Explaining Env config

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/Env.html

Annotators

http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/TokensRegexAnnotator.html http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/TokensRegexNERAnnotator.html

Code examples

https://github.com/stanfordnlp/CoreNLP/tree/master/doc/tokensregex/examples https://github.com/stanfordnlp/CoreNLP/tree/master/src/edu/stanford/nlp/ling/tokensregex/demo https://github.com/stanfordnlp/CoreNLP/tree/master/src/edu/stanford/nlp/ling/tokensregex/demo/rules

Other

http://nlp.stanford.edu/software/tokensregex.shtml#TokensRegexPatterns

FOR FINDING WORDS THAT END WITH SOMETHING: http://www.litscape.com/word_tools/ends_with.php

simongray commented 8 years ago

Maybe also take a look at this: http://nlp.stanford.edu/software/patternslearning.html

simongray commented 8 years ago

Using /r/worldnews headlines (or some other news source)

I can use the BigQuery stuff to get a batch of headlines and then use something like the important ngrams method from NLTK to extract entities.

simongray commented 8 years ago

I'll be using the more general approach with subjects as described in #23 instead of the focus on named entities.