spencermountain / compromise

modest natural-language processing
http://compromise.cool
MIT License
11.5k stars 654 forks source link

is_plural again #38

Closed redaktor closed 9 years ago

redaktor commented 9 years ago

Hm - the last commit does not work properly because

in pluralize_rules we have rules for both singular to plural AND plural to plural while in singularize_rulesit is only plural to singular

(???)

In general I am working on a factory method called "dictionary" based on the "words" and "rules" and this can be autotranslated by our database to several languages covering the ngram and metrics etc.

spencermountain commented 9 years ago

yeah, i agree. 'towns'.pluralize() should be 'towns'. gimme a sec.

redaktor commented 9 years ago

yeah, maybe I got it already - we are doing the same now ;) In general it must be [delete/rollback my commit] - I am deeply sorry. singularize_rules SHOULD cover 'plural to singular' AND 'singular to singular'.

spencermountain commented 9 years ago

;) no problem. can you check out the current version? I've added some towns.pluralize()==towns tests and they look so-far-so-good

redaktor commented 9 years ago

:+1: will check. If you've got an additional minute: This is a WIP proposal of the "dictionary" I mentioned above. Basically this could be a helper for development. It is not complete. Please read the TODO comments and see the generator functions at the end. And when I would add 'rules' like 'words' then logic and (language or context specific) data is fully seperated for development. It is a simple generator function where we could generate the data module part with... But no it is easy for our "machines" ;) How do you think?

redaktor commented 9 years ago

@spencermountain Yep. works. I need to correct the dictionary for the nouns a bit and now there are some dups. already covered by rule :

[
    ['move', 'moves'],
    ['photo', 'photos'],
    ['video', 'videos'],
    ['rodeo', 'rodeos'],
    ['stomach', 'stomachs'],
    ['shoe', 'shoes'],
    ['epoch', 'epochs'],
    ['zero', 'zeros'],
    ['avocado', 'avocados'],
    ['halo', 'halos'],
    ['tornado', 'tornados'],
    ['tuxedo', 'tuxedos'],
    ['sombrero', 'sombreros']
];

The dictionary also adds some more irregular plurals not covered by rule and compresses them by replacing the plural by singular.slice(0,-2) ...

redaktor commented 9 years ago

updated dictionary, haven't looked what could go to rules (the other way around), checking now ... off course the dictionary generators could become gruntified itself. As said it is work in progress. the replace function will become short syntax, for now e. g. irregular plurals looks like

/* singular nouns having irregular plurals */
var lang = en;
var noun_irregulars = (function() {
  var zip = [ [ 'child', '=ren' ],
  [ 'person', 'people' ],
  [ 'leaf', '_av$' ],
  [ 'database', '=s' ],
  [ 'quiz', '=z$' ],
  [ 'goose', 'ge$e' ],
  [ 'phenomenon', '_a' ],
  [ 'barracks', '=' ],
  [ 'deer', '=' ],
  [ 'syllabus', '_i' ],
  [ 'index', '_ic$' ],
  [ 'appendix', '_ic$' ],
  [ 'criterion', '_a' ],
  [ 'i', '_we' ],
  [ 'man', '_en' ],
  [ 'she', 'they' ],
  [ 'he', '_t=y' ],
  [ 'myself', 'ourselv$' ],
  [ 'yourself', '_lv$' ],
  [ 'himself', 'themselv$' ],
  [ 'herself', 'themselv$' ],
  [ 'themself', '_lv$' ],
  [ 'mine', 'ours' ],
  [ 'hers', 't_irs' ],
  [ 'his', 't_eirs' ],
  [ 'its', 'the_rs' ],
  [ 'theirs', '=' ],
  [ 'sex', '=e_' ],
  [ 'narrative', '=s' ],
  [ 'addendum', '_a' ],
  [ 'alga', '=e' ],
  [ 'alumna', '=e' ],
  [ 'alumnus', '_i' ],
  [ 'bacillus', '_i' ],
  [ 'beau', '=x' ],
  [ 'cactus', '=$' ],
  [ 'château', '=x' ],
  [ 'corpus', '_ora' ],
  [ 'curriculum', '_a' ],
  [ 'die', '_ice' ],
  [ 'echo', '=$' ],
  [ 'embargo', '=$' ],
  [ 'foot', 'feet' ],
  [ 'formula', '=s' ],
  [ 'genus', '_era' ],
  [ 'graffito', '_ti' ],
  [ 'hippopotamus', '_i' ],
  [ 'larva', '=e' ],
  [ 'libretto', '_ti' ],
  [ 'loaf', '_av$' ],
  [ 'matrix', '_ic$' ],
  [ 'memorandum', '_a' ],
  [ 'mosquito', '=$' ],
  [ 'opus', '_era' ],
  [ 'ovum', '_a' ],
  [ 'ox', '_=en' ],
  [ 'radius', '=$' ],
  [ 'referendum', '_a' ],
  [ 'tableau', '=x' ],
  [ 'that', '_ose' ],
  [ 'that', '_$$' ],
  [ 'thief', '_ev$' ],
  [ 'this', '_$e' ],
  [ 'tooth', 'teeth' ],
  [ 'vita', '=e' ] ]; 

  var main = zip.map(function (arr) { arr[1] = arr[1].replace('=',arr[0]).replace('_', arr[0].slice(0,-2)).replace(/\$/g,'es'); return arr; });
  if (typeof module !== "undefined" && module.exports) module.exports = main;

  return main;
})();
spencermountain commented 9 years ago

oh, this idea for pulling these out into a file is good. They can go in the lexicon. Good one!

redaktor commented 9 years ago

cool. Please note that the $ replacer in the zip function must be stripped in regexes like in the zip.map function above. It might be better to use another small special character. Note to me: think before you code ;)