mediacloud / cliff-annotator

A lightweight server to allow HTTP requests to the Stanford Named Entity Recognized and a heavily modified CLAVIN geoparser.
https://cliff.mediacloud.org
Apache License 2.0
119 stars 35 forks source link

Case insensitivity? #27

Closed krdyke closed 10 years ago

krdyke commented 10 years ago

Hi there!

We're planning to utilize CLIFF as part of a broader project on the history of hip hop in the Twin Cities. The idea is to feed lyrics into the parser and see what sort of geographical rhyming is happening. Not quite the use case you envisioned, I imagine, but that's the beauty of FOSS, right?

Anyways, based on the lyrics we've collected/seen, many sources do not capitalize place names. From my testing it seems that CLIFF's text parser is case sensitive, and I'm wondering if there's a fairly painless way to make it case insensitive?

If you could at least point me to the direction in the code, I can take a crack at it.

Thanks!

kanarinka commented 10 years ago

First of all, that is completely awesome as a use case of CLIFF!!

Secondly, you are right that the parser is case sensitive. This is coming from the underlying Stanford Core NLP parser that uses the case of the text as an indicator for the entities that it is extracting. You would want to download a caseless model for the Stanford NER which you can find here: http://nlp.stanford.edu/software/CRF-NER.shtml

And then integrate that into the CLAVIN technology that underlies CLIFF -- https://github.com/Berico-Technologies/CLAVIN

You also might try posting on CLAVIN's github account to see if anyone has integrated a caseless version of the parser and maybe you could just use their code. There would be a wide variety of applications, like parsing twitter and text messages for example.

Let me know if we can help you further - would love to see the final result Catherine

On Mon, Jul 14, 2014 at 5:14 PM, Kevin Dyke notifications@github.com wrote:

Hi there!

We're planning to utilize CLIFF as part of a broader project on the history of hip hop in the Twin Cities. The idea is to feed lyrics into the parser and see what sort of geographical rhyming is happening. Not quite the use case you envisioned, I imagine, but that's the beauty of FOSS, right?

Anyways, based on the lyrics we've collected/seen, many sources do not capitalize place names. From my testing it seems that CLIFF's text parser is case sensitive, and I'm wondering if there's a fairly painless way to make it case insensitive?

If you could at least point me to the direction in the code, I can take a crack at it.

Thanks!

— Reply to this email directly or view it on GitHub https://github.com/c4fcm/CLIFF/issues/27.

krdyke commented 10 years ago

Thanks for the tips! We'll keep you apprised of how things progress. For now I'll close this issue. Thanks again!

charlieg commented 10 years ago

Indeed, if you are using the CLAVIN-NERD distribution in CLIFF, you can load a caseless model for Stanford NER as Catherine mentioned. The "regular" version of CLAVIN, however, uses Apache OpenNLP for named entity recognition, and I'm not aware of any caseless models for OpenNLP.

kanarinka commented 10 years ago

Hey Charlie --

I've been meaning to contact you to let you know that Rahul and I wrote a paper about CLIFF-CLAVIN that was just accepted to a workshop at KDD about news knowledge discovery - http://ailab.ijs.si/~blazf/NewsKDD2014/

I'm attaching the paper here for your reference (Can I attach things in github? going to give it a shot). I tried emailing to your bericotechnologies account but it bounced.


www.kanarinka.com || kanarinka@ikatun.org || 617-501-2441


On Wed, Jul 16, 2014 at 7:06 AM, Charlie Greenbacker < notifications@github.com> wrote:

Indeed, if you are using the CLAVIN-NERD https://github.com/Berico-Technologies/CLAVIN-NERD distribution in CLIFF, you can load a caseless model for Stanford NER as Catherine mentioned. The "regular" version of CLAVIN, however, uses Apache OpenNLP for named entity recognition, and I'm not aware of any caseless models for OpenNLP.

— Reply to this email directly or view it on GitHub https://github.com/c4fcm/CLIFF/issues/27#issuecomment-49151076.

krdyke commented 10 years ago

Thanks Charlie, I'll swap out CLAVIN for CLAVIN-NERD. That explains some things. I had implemented the caseless Stanford NER on the CLIFF side of things without messing with CLAVIN, and my results were, to say the least, interesting. Thanks again!

charlieg commented 10 years ago

Catherine, I just responded to you via email at your ikatun.org address. Please let me know if you don't receive it!

rahulbot commented 10 years ago

Short story - CLIFF is using Stanford-NER and it's not hard to drop in a different model.

Details: CLIFF uses Stanford-NER, not Apache OpenNLP. However, we could easily be using a case-sensitive NER model. ParseManager.java#L232 is where it loads the model, but of course it just does that from the config file. The README explains how that works and which model we're using. To add a new model you just have to add a case to this switch statement and edit the config file.

krdyke commented 10 years ago

Interesting. That was what I did in the first place (see it here https://github.com/SemanticArchives/CLIFF/blob/d75ed0eb7e8e8cc5ad6a16761458a8ea09219113/src/main/java/org/mediameter/cliff/extractor/StanfordNamedEntityExtractor.java#L58 on our fork).

It seemed that I was getting odd results, but I think I'll do more extensive testing (I only used a couple test strings).

On Wed, Jul 16, 2014 at 1:26 PM, rahulbot notifications@github.com wrote:

Short story - CLIFF is using Stanford-NER and it's not hard to drop in a different model.

Details: CLIFF uses Stanford-NER, not Apache OpenNLP. However, we could easily be using a case-sensitive NER model. ParseManager.java#L232 https://github.com/c4fcm/CLIFF/blob/master/src/main/java/org/mediameter/cliff/ParseManager.java#L232 is where it loads the model, but of course it just does that from the config file. The README https://github.com/c4fcm/CLIFF/blob/3135633059a78f9eb4bd0f06549f63a06458e143/README.md#nermodeltouse explains how that works and which model we're using. To add a new model you just have to add a case to this switch statement https://github.com/c4fcm/CLIFF/blob/c52140218a25cc3bee992690d1d6fd5cba836776/src/main/java/org/mediameter/cliff/extractor/StanfordNamedEntityExtractor.java#L58 and edit the config file.

— Reply to this email directly or view it on GitHub https://github.com/c4fcm/CLIFF/issues/27#issuecomment-49206626.