Open MarketingPip opened 1 year ago
@spencermountain - proposed solution to this:
Have separate list of words for things that are
When the lexicon is merged add a rule such as #UPPERCASE_ONLY
.
This will help for taking things like He lives in the US
as a place.
Or words such as "House" when used as proper noun.
if you want to do matches based on case i recommend using the @methods like:
nlp('house of commons').match('(house && @isTitleCase) of commons')
nlp('in the us').match('(us && @isUpperCase) of commons')
I know that's a little awkward. I'm not keen to make a change to the lexicon, as it would be a breaking change.
@spencermountain - not ideal. This should be done for the whole project. Even if it makes a breaking change, it will be well worth it.
This code isn't complete but rough idea of what I am saying we should do.
function mergeLexiconLists(list1, list2) {
const mergedLexicon = { ...list1 };
for (const word in list2) {
const lowercaseWord = word.toLowerCase();
if (mergedLexicon.hasOwnProperty(lowercaseWord)) {
const mergedCategories = new Set([
...(Array.isArray(mergedLexicon[lowercaseWord])
? mergedLexicon[lowercaseWord]
: [mergedLexicon[lowercaseWord]]),
...(Array.isArray(list2[word]) ? list2[word] : [list2[word]]),
]);
// Check if the word in list2 is in title case or all uppercase
if (isFirstLetterUpperCase(word)) {
for(let item in list2[word]){
mergedCategories.add(`${list2[word][item]}_tileCase`);
}//
}
//
if (isAllCaps(word)) {
for(let item in list2[word]){
mergedCategories.add(`${list2[word][item]}_UPPERCASE`);
}//
}
mergedLexicon[lowercaseWord] = Array.from(mergedCategories);
} else {
mergedLexicon[lowercaseWord] = list2[word];
}
}
return mergedLexicon;
}
function isAllCaps(str) {
// Check if the string has any lowercase letters or non-alphabetic characters
if (str === str.toUpperCase() && str !== str.toLowerCase()) {
return true;
} else {
return false;
}
}
function isFirstLetterUpperCase(str) {
// Check if the first character of the string is an uppercase letter
if (str.charAt(0) === str.charAt(0).toUpperCase()) {
return true;
} else {
return false;
}
}
let lexicon1 = {
apple: 'Fruit',
a: 'Fruit',
house: ['Verb'],
us: ['#Verb'],
world: ['noun'],
};
const lexicon2 = {
amazing: ['#Test'],
Apple: ['Noun'],
House: ['#Noun'],
US: ['#Place'],
hello: ['#Tests'],
};
const mergedLexicon = mergeLexiconLists(lexicon1, lexicon2);
console.log(mergedLexicon);
If you want to hack on that / end up hacking on it, send me a copy back haha!
But this will substantively help Compromise.js tag words better. While keeping the data the EXACT same size (beside 3 tags - which again). Think how much this will help the rule set and lots more.
We will have to re tag words - (see there meaning when used as title case / upper case). Plus this will help SO much better for acronyms and MUCH more.
Before dismissing this HUGELY needed feature. Think of the enhancement's it will bring.
plus think of useful - #Place_titleCase would be for other rules etc..
@spencermountain - see this! Page 133 (PDF) - here
Explains those POS rules I referenced earlier.
As well all the data / rules can be found here.
That PDF might change your mind about doing something like this for old issue / feature request I made here.
Taken from PDF source.
Two additional lexicons exist - one for texts in all uppercase (lexicon cap), and
one for texts in all lowercase (lexicon lower).
I would this this would ease some pain instead of writing some rules based on context to match...
And solve some old issues / more than like currently persisting like this one
ps; enjoy your weekend. 🥂
Hoping this get's done, but will be a big enough task.
Would be nice to support added for this -
Some words like house can represent a different meaning when title cased.
Example:
House of Commons
- house is used as a proper noun.Where
This is my house and my family's ancestral home
is used as a noun.This will be able to help improve the part of speech tagger big time. As well would be useful for things like country codes.
Where
US
would currently be detected in "then there was two of us" with the current tagger.I think this would be way easier rather than have regex plugins to do matches for things like this.
Then ideally we will then re-tag all the word tile cased in the current dataset.