people() seems to get confused by commas

sandro-pasquali commented 5 months ago

Great library!

Loving the various entity extraction utilities. They work great. One I use is people(). However, it seems to be unable to separate a list of comma-separated names into individual names, at least in this case.

This is what I'm seeing [ NodeJs 22, OSX, "compromise": "^14.13.0" ]:

import Nlp from 'compromise';

const text = `The NAACP’s founding members included white progressives Mary White Ovington, Henry Moskowitz, William English Walling and Oswald Garrison Villard, along with such African Americans as W.E.B. Du Bois, Ida B. Wells, Archibald Grimke and Mary Church Terrell.`;

const processed = Nlp(text);
console.log(processed.people().out('array'));

// [
//     'Mary White Ovington, Henry Moskowitz, William English Walling',
//     'Oswald Garrison Villard,',
//     'Ida B. Wells, Archibald Grimke',
//     'Mary Church Terrell.'
// ]

As a side note, you can also see it isn't catching W.E.B Du Bois but that seems a complex pattern, and prob best here would be to add to the custom lexicon I'm guessing.

Thanks again for compromise!

spencermountain commented 5 months ago

hey Sandro - good catch! Happy to fix this for the next release. cheers

spencermountain commented 3 months ago

got it fixed in 14.14.0! cheers

sandro-pasquali commented 3 months ago

I'd like to tell and celebrate the value of the work you do to build and maintain this excellent library. So I'll share a concrete example of the positive impact your diligence makes.

This is the test output I was seeing which prompted my original question:

  people: [
    'Mary White Ovington Henry Moskowitz William English Walling',
    'Oswald Garrison Villard',
    'Ida B Wells Archibald Grimke',
    'Mary Church Terrell'
  ],

Then you released 14.14.0, and I updated to that version. I did nothing else.

This is now the test output:

  people: [
    'Mary White Ovington',
    'Henry Moskowitz',
    'William English Walling',
    'Oswald Garrison Villard',
    'Ida B Wells',
    'Archibald Grimke',
    'Mary Church Terrell'
  ],

Happy start to the day. Thank you.

priley86 commented 2 months ago

hey @spencermountain - in relationship to this issue, wanted to quickly ask if you can think of any easy methodology or option to remove periods from the .people(), .places(), .organizations() extractors? I'm noting you have .clauses(), but was hoping for some insight on how you'd approach if needed.

I'm noting that periods always seem to be included w/ these extractors when they fall at the end of a sentence. ex:

      const prompt =
        'Hello my name is John Doe. My email is john@gmail.com. I live in New York. Jane Smith also works at my company as the chief operating officer and lives in New Jersey. Our company is Smith & Doe LLC.';

       const processed = Nlp(prompt);
      console.log('people:', processed.people().out());
      // ['John Doe.', 'Jane Smith', 'John Doe', 'Jane Smith']

(also depicted in your documentation: https://observablehq.com/@spencermountain/topics-named-entity-recognition)

Whereas, i'd love the option to make the output be:

      // ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith']

So that my tokenization utilities applied afterward can correctly identify the same token & correctly locate offsets w/o periods (and better align entities in ML models).

Full context here, i'm experimenting w/ this NER along with a few others and seeing some inconsistency w/ some including and others not including the period. Currently using the nlp.js builtin-compromise here for extracting entities: https://github.com/axa-group/nlp.js/blob/master/packages/builtin-compromise/src/builtin-compromise.js#L106-L108

const { dockStart } = require('@nlpjs/basic');
const { BuiltinCompromise } = require('@nlpjs/builtin-compromise');
const dock = await dockStart({
    settings: {
      nlp: {
        forceNER: true,
        languages: ['en'],
      },
    },
    use: ['Basic', 'LangEn'],
  });

  // Register Builtins
  const ner = dock.get('ner');
  ner.container.register('extract-builtin-??', new BuiltinMicrosoft(), true);

  const builtinCompromise = new BuiltinCompromise({
    enable: [
      'hashtags',
      'person',
      'place',
      'organization',
      'email',
      'phonenumber',
      'date',
      'url',
      'number',
      'dimension',
    ],
  });
  ner.container.register('extract-builtin-??', builtinCompromise, true);

  const manager = dock.get('nlp');

  const response: INlpJsEntityResponse = await manager.process(text);

  const entities = response.entities.map((entity) => ({
    text: entity.utteranceText,
    type: entity.entity.split('_')[0], // e.g., 'person' from 'person_1' (ignoring the type count provided from nlp.js)
    startOffset: entity.start,
    // noting that the end offset is exclusive of the last character w/ nlp.js plugins
    endOffset: entity.end + 1,
    confidenceScore: entity.accuracy,
  }));

Any thoughts or suggestions?

priley86 commented 2 months ago

It would seem we need some additional grammatical handling here for .sentences() and .clauses() as well ;-)

https://www.quora.com/When-ending-a-sentence-with-an-abbreviation-that-ends-with-a-period-do-you-place-an-additional-period#:~:text=Contemporary%20style%20is%20to%20use,used%20and%20are%20not%20wrong.

spencermountain commented 2 months ago

hey Patrick, you can print the matches off with any text options you'd like. There are some janky default choices about when to include sentence-end punctuation in the text output, which you can always override with config.

I would do something like this:

const prompt =  'Hello my name is John Doe. My email is john@gmail.com. I live in New York. Jane Smith also works at my company as the chief operating officer and lives in New Jersey. Our company is Smith & Doe LLC.';

let opts = {trim:true, keepPunct:false}
const processed = nlp(prompt).people();

processed.forEach(person => {
  console.log(person.text(opts))
})

cheers

priley86 commented 2 months ago

hey @spencermountain ! Really appreciate your work on this library and this response.

I understand that maybe this could be used in the internals of nlp.js 's built-in here: https://github.com/axa-group/nlp.js/blob/master/packages/builtin-compromise/src/builtin-compromise.js#L106-L108

For now though, i did some post-processing here using your .sentences() helper:

// Returns an array of sentence objects **that are statements** with start and end offsets
export function promptStatementSentences(
  text: string
): ICompromiseSentenceMetaExtended[] {
  const processed = compromise(text);
  const sentences = processed.sentences().json();
  let startingAt = 0;

  return sentences.map((sM: ICompromiseSentenceMeta) => {
    const sentenceStart = text.indexOf(sM.text, startingAt);
    const sentenceEnd = sentenceStart + sM.text.length;
    startingAt = sentenceEnd;

    return {
      ...sM,
      start: sentenceStart,
      end: sentenceEnd,
    };
  });
}

/**
 * Applies some post processing to nlp.js entities response to format in a consistent manner with
 * other entity recognition clients.
 * @param text original prompt text
 * @param entities raw nlp.js entities response
 */
export function processNlpjsEntitiesResponse(
  text: string,
  rawEntities: INlpJsEntityResponse['entities']
) {
  const entities: Entity[] = [];

  const sentences = promptStatementSentences(text);

  for (const entity of rawEntities) {
    if (entity.utteranceText.endsWith('.')) {
      const sentence = sentences.find(
        (s) => s.start <= entity.start && s.end >= entity.end
      );
      if (sentence && sentence.text.endsWith(entity.utteranceText)) {
        // check if this entity is at the end of a sentence and the period should be omitted
        entity.utteranceText = entity.utteranceText.slice(0, -1);
      }
    }
    entities.push({
      text: entity.utteranceText,
      type: entity.entity.split('_')[0], // e.g., 'person' from 'person_1' (ignoring the type count provided from nlp.js)
      startOffset: entity.start,
      endOffset: entity.start + entity.utteranceText.length,
      confidenceScore: entity.accuracy,
    });
  }

  return entities;
}

...
// code mentioned above
  const ner = dock.get('ner');
  // ner.container.register('extract-builtin-??', new BuiltinDefault(), true);
  ner.container.register('extract-builtin-??', new BuiltinMicrosoft(), true);

  const builtinCompromise = new BuiltinCompromise({
    enable: [
      'hashtags',
      'person',
      'place',
      'organization',
      'email',
      'phonenumber',
      'date',
      'url',
      'number',
      'dimension',
    ],
  });

  const response: INlpJsEntityResponse = await manager.process(text);

  //apply post-processing
  return processNlpjsEntitiesResponse(text, response.entities);

This seems to work pretty well when entities are at the end of the sentence (as long as they are not abbreviated).

One other challenge I'm definitely noting here though with this approach is the .sentences() and the .clauses() helpers and grammatical handling with respect to closing sentences and abbreviations.

https://www.quora.com/When-ending-a-sentence-with-an-abbreviation-that-ends-with-a-period-do-you-place-an-additional-period#:~:text=Contemporary%20style%20is%20to%20use,used%20and%20are%20not%20wrong

Some examples they don't seem to properly note the end of the clause/sentence when a single period is used (to complete the acronym and the sentence):

a.m. — “Her surgery is scheduled for Wednesday at 10:30 a.m.” p.m. — “Please be home for dinner by 6:15 p.m.” U.S. — “I grew up in various countries, but I have spent most of my life in the U.S.” etc. (U.S. style) — “Over the years I have had different pets: a dog, cats, chameleon, turtle, bunny, hermit crab, fish, etc.” Calif. — “Shaun used to live in Los Angeles, Calif.”

spencermountain / compromise

people() seems to get confused by commas #1111