spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.43k stars 656 forks source link

Better Replacement Of Subsets Of Text #609

Open AlexanderKidd opened 5 years ago

AlexanderKidd commented 5 years ago

First off, really glad that this library is around!

I had to write some ugly regex/manual token splitting of an article of text in order to remove acronyms/abbreviations:

 nlpText = nlp(scrapedText.replace(/\.-/g, '. '));
 abbrList = nlpText.match('(#Acronym|#Abbreviation)').text();

 abbrList.split(' ').forEach(function(token, index, arr) {
   if(token) {
    var re = new RegExp('\\s' + token + '\\s', "g");
    console.log(re);
    var replace = token.replace(/\./g, '');
    scrapedText = scrapedText.replace(re, ' ' + replace + ' ');
   }
 });

Some issues I had:

  1. The first line is more of an edge case - if there are two abbreviations stuck together (e.g., Brig.-Gen. Irvin McDowell).

  2. The next line deals with accessing the acronym/abbreviation set and functionality. It might not be a bad idea to expose the abbreviation.js portion of the script (I know it is module.exported but for us client-side-only people some sort of access function or global would help 😄 ).

  3. More to the point: If getting the abbreviations wasn't an issue, it was punctuation. It would help to iterate through a list of abbreviations/acronyms by calling nlpText.abbreviations() and then stripping them of the punctuation like acronyms().stripPeriods() does, or just have a normalization option for all abbreviations with punctuation (e.g., remove periods from G., Maj., Dr., U.K.). The acronyms function didn't seem to strip periods from single letters so it half-solved what I wanted to accomplish.

  4. Lastly, it may seem like a match/replace would have solved this right off the bat. replace() did not seem to want to replace periods though (a plain period appears to be treated as the wildcard like in regex as well but I couldn't get . to fix it). Match seems to collect all pattern matches in the corpus, but replace seemed to behave oddly with punctuation.

I was hoping either you knew of a clearer solution or maybe this helps in tweaking some of the functionality of this awesome library. Thanks for creating it!

spencermountain commented 5 years ago

hi Alexander, thank you for this. Your timing is great.

yeah, replace() is borked. I made some bad data-model decisions a few years ago, and this is the reason for the big refactor that's taking place now.

I say your timing is good, because grabbing abbreviations from the lexicon (which can be configured) has been on and off my todo list. I'm firmly putting it back on now, based on what you've said.

i also fully agree with abbreviations().normalize() - love this idea. That's something that should be trivial to do.

ah, i hate that you've had to flatten it into a text, then call .split(' '). You're not supposed to, but if you wanted, you could do something like this:

https://runkit.com/spencermountain/5d405f2a2c229e0013680b07 that just loops through the internal compromise Term objects. which will, at least, not require re-parsing the document.

spencermountain commented 5 years ago

ugh. and I was having such a good day before seeing this! ;)

no but really, these are the kind of problems I need to see. There's the compromise api, and its functions, and there are features that people want at the periphery - and there's no clean way to extend it another yard. Then the issue becomes getting the data out of compromise, doing things yourself, and getting it back in. That stuff sucks.

okay, just thinking aloud. Thanks.

AlexanderKidd commented 5 years ago

The plugins idea seems like it will work well. I think the way it is follows SOLID (especially open to extension but closed to modification).

Plugins act as sort of an import function to ingest new words and keep them in the global nlp instance. I see Issue #505 would like synonyms, so eventually people will probably want full dictionary/thesaurus support. Maybe importing from file or referencing it? But I totally get it, you would start to build a framework on an existing library and that can get messy (inversion of control and all that).

I have been using it for my fact-checking project FactoidL, by the way: https://github.com/AlexanderKidd/FactoidL

One step at a time I suppose. Thank you.