pelias / parser

natural language classification engine for geocoding
https://parser.demo.geocode.earth
MIT License
55 stars 28 forks source link

pluralize words in place dictionary #119

Closed missinglink closed 4 years ago

missinglink commented 4 years ago

This PRs adds a function which optionally pluralizes a dictionary of words. This is useful for cases like Foo Hotels and Homes where the terms would otherwise not be classified as 'place'.

The library I chose is https://github.com/plurals/pluralize, mainly because it seems quite popular, I'm open to alternatives. I suspect one problem with selecting this library is that it's probably only using English word rules.

@Joxit maybe we can add French and others too?

missinglink commented 4 years ago

How do pluralizers work for languages with gender-based rules, do they require the full dictionary with corresponding genders to be accurate?

Joxit commented 4 years ago

I think, for French, I should add all plurals by hand... This lang is so... :sweat:

You are right, this is only for English (or language with English-like grammar). I tried for the place bureau, the plural is bureaux (/eau$/ => 'eaux'), but for places like bureau de change, the plural is on bureau, so the result should be bureaux de change (=>currency exchange)

var pluralize = require("pluralize")
pluralize.addPluralRule(/eau$/i, 'eaux')
pluralize('bureau') // bureaux OK
pluralize('bureau de change') // bureau de changes NOT OK

I also tried pluralize-fr and french-words but they are failing on bureaux de change, they do a plural on each word (=>bureaux des changes)...

IMO the safest way is adding all plural in the pelias dictionary place_names.txt, or in a new file place_names.plural.txt (at least for French). I know we do not have enough knowledge in all languages to cover the world... :disappointed:

So maybe we can use this lib for English (after a review of generated places ?) but not for all languages :confused:

missinglink commented 4 years ago

I wrote the inverse of this a few years ago to singularize words using English grammar rules: https://github.com/pelias/analysis/blob/master/test/tokenizer/singular.js

I can probably port these tests across and invert them

missinglink commented 4 years ago

Added some tests via https://github.com/pelias/parser/pull/119/commits/111bee75caef77f1ff92bc8f956b9d23a31f8b10 including tests to ensure that the plurals are not generated for non-English tokens.

The npm pluralize library is actually pretty good for English, there are unfortunately some ambiguous words such as "staff" which can pluralize to "staff" (for staff members) and "staves" (for a several walking/fighting sticks).

I'm happy to merge this as-is and work on adding other languages in subsequent PRs