Closed sandro-pasquali closed 3 months ago
hey Sandro - good catch! Happy to fix this for the next release. cheers
got it fixed in 14.14.0! cheers
I'd like to tell and celebrate the value of the work you do to build and maintain this excellent library. So I'll share a concrete example of the positive impact your diligence makes.
This is the test output I was seeing which prompted my original question:
people: [
'Mary White Ovington Henry Moskowitz William English Walling',
'Oswald Garrison Villard',
'Ida B Wells Archibald Grimke',
'Mary Church Terrell'
],
Then you released 14.14.0, and I updated to that version. I did nothing else.
This is now the test output:
people: [
'Mary White Ovington',
'Henry Moskowitz',
'William English Walling',
'Oswald Garrison Villard',
'Ida B Wells',
'Archibald Grimke',
'Mary Church Terrell'
],
Happy start to the day. Thank you.
hey @spencermountain - in relationship to this issue, wanted to quickly ask if you can think of any easy methodology or option to remove periods from the .people()
, .places()
, .organizations()
extractors? I'm noting you have .clauses()
, but was hoping for some insight on how you'd approach if needed.
I'm noting that periods always seem to be included w/ these extractors when they fall at the end of a sentence. ex:
const prompt =
'Hello my name is John Doe. My email is john@gmail.com. I live in New York. Jane Smith also works at my company as the chief operating officer and lives in New Jersey. Our company is Smith & Doe LLC.';
const processed = Nlp(prompt);
console.log('people:', processed.people().out());
// ['John Doe.', 'Jane Smith', 'John Doe', 'Jane Smith']
(also depicted in your documentation: https://observablehq.com/@spencermountain/topics-named-entity-recognition)
Whereas, i'd love the option to make the output be:
// ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith']
So that my tokenization utilities applied afterward can correctly identify the same token & correctly locate offsets w/o periods (and better align entities in ML models).
Full context here, i'm experimenting w/ this NER along with a few others and seeing some inconsistency w/ some including and others not including the period. Currently using the nlp.js builtin-compromise here for extracting entities: https://github.com/axa-group/nlp.js/blob/master/packages/builtin-compromise/src/builtin-compromise.js#L106-L108
const { dockStart } = require('@nlpjs/basic');
const { BuiltinCompromise } = require('@nlpjs/builtin-compromise');
const dock = await dockStart({
settings: {
nlp: {
forceNER: true,
languages: ['en'],
},
},
use: ['Basic', 'LangEn'],
});
// Register Builtins
const ner = dock.get('ner');
ner.container.register('extract-builtin-??', new BuiltinMicrosoft(), true);
const builtinCompromise = new BuiltinCompromise({
enable: [
'hashtags',
'person',
'place',
'organization',
'email',
'phonenumber',
'date',
'url',
'number',
'dimension',
],
});
ner.container.register('extract-builtin-??', builtinCompromise, true);
const manager = dock.get('nlp');
const response: INlpJsEntityResponse = await manager.process(text);
const entities = response.entities.map((entity) => ({
text: entity.utteranceText,
type: entity.entity.split('_')[0], // e.g., 'person' from 'person_1' (ignoring the type count provided from nlp.js)
startOffset: entity.start,
// noting that the end offset is exclusive of the last character w/ nlp.js plugins
endOffset: entity.end + 1,
confidenceScore: entity.accuracy,
}));
Any thoughts or suggestions?
It would seem we need some additional grammatical handling here for .sentences()
and .clauses()
as well ;-)
hey Patrick, you can print the matches off with any text options you'd like. There are some janky default choices about when to include sentence-end punctuation in the text output, which you can always override with config.
I would do something like this:
const prompt = 'Hello my name is John Doe. My email is john@gmail.com. I live in New York. Jane Smith also works at my company as the chief operating officer and lives in New Jersey. Our company is Smith & Doe LLC.';
let opts = {trim:true, keepPunct:false}
const processed = nlp(prompt).people();
processed.forEach(person => {
console.log(person.text(opts))
})
cheers
hey @spencermountain ! Really appreciate your work on this library and this response.
I understand that maybe this could be used in the internals of nlp.js 's built-in here: https://github.com/axa-group/nlp.js/blob/master/packages/builtin-compromise/src/builtin-compromise.js#L106-L108
For now though, i did some post-processing here using your .sentences()
helper:
// Returns an array of sentence objects **that are statements** with start and end offsets
export function promptStatementSentences(
text: string
): ICompromiseSentenceMetaExtended[] {
const processed = compromise(text);
const sentences = processed.sentences().json();
let startingAt = 0;
return sentences.map((sM: ICompromiseSentenceMeta) => {
const sentenceStart = text.indexOf(sM.text, startingAt);
const sentenceEnd = sentenceStart + sM.text.length;
startingAt = sentenceEnd;
return {
...sM,
start: sentenceStart,
end: sentenceEnd,
};
});
}
/**
* Applies some post processing to nlp.js entities response to format in a consistent manner with
* other entity recognition clients.
* @param text original prompt text
* @param entities raw nlp.js entities response
*/
export function processNlpjsEntitiesResponse(
text: string,
rawEntities: INlpJsEntityResponse['entities']
) {
const entities: Entity[] = [];
const sentences = promptStatementSentences(text);
for (const entity of rawEntities) {
if (entity.utteranceText.endsWith('.')) {
const sentence = sentences.find(
(s) => s.start <= entity.start && s.end >= entity.end
);
if (sentence && sentence.text.endsWith(entity.utteranceText)) {
// check if this entity is at the end of a sentence and the period should be omitted
entity.utteranceText = entity.utteranceText.slice(0, -1);
}
}
entities.push({
text: entity.utteranceText,
type: entity.entity.split('_')[0], // e.g., 'person' from 'person_1' (ignoring the type count provided from nlp.js)
startOffset: entity.start,
endOffset: entity.start + entity.utteranceText.length,
confidenceScore: entity.accuracy,
});
}
return entities;
}
...
// code mentioned above
const ner = dock.get('ner');
// ner.container.register('extract-builtin-??', new BuiltinDefault(), true);
ner.container.register('extract-builtin-??', new BuiltinMicrosoft(), true);
const builtinCompromise = new BuiltinCompromise({
enable: [
'hashtags',
'person',
'place',
'organization',
'email',
'phonenumber',
'date',
'url',
'number',
'dimension',
],
});
const response: INlpJsEntityResponse = await manager.process(text);
//apply post-processing
return processNlpjsEntitiesResponse(text, response.entities);
This seems to work pretty well when entities are at the end of the sentence (as long as they are not abbreviated).
One other challenge I'm definitely noting here though with this approach is the .sentences()
and the .clauses()
helpers and grammatical handling with respect to closing sentences and abbreviations.
Some examples they don't seem to properly note the end of the clause/sentence when a single period is used (to complete the acronym and the sentence):
a.m. — “Her surgery is scheduled for Wednesday at 10:30 a.m.” p.m. — “Please be home for dinner by 6:15 p.m.” U.S. — “I grew up in various countries, but I have spent most of my life in the U.S.” etc. (U.S. style) — “Over the years I have had different pets: a dog, cats, chameleon, turtle, bunny, hermit crab, fish, etc.” Calif. — “Shaun used to live in Los Angeles, Calif.”
Great library!
Loving the various entity extraction utilities. They work great. One I use is
people()
. However, it seems to be unable to separate a list of comma-separated names into individual names, at least in this case.This is what I'm seeing [ NodeJs 22, OSX, "compromise": "^14.13.0" ]:
As a side note, you can also see it isn't catching
W.E.B Du Bois
but that seems a complex pattern, and prob best here would be to add to the custom lexicon I'm guessing.Thanks again for
compromise
!