olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.92k stars 547 forks source link

Support more languages #16

Closed gmarty closed 10 years ago

gmarty commented 11 years ago

I know the web is English centric, but that would be nice to support more languages through a plugin system. Fortunately there are several tools out there for tokenization and stemming that can be used:

I don't really have the time to work on this feature, but I'd like to see lunr.js going this way.

olivernn commented 11 years ago

Sorry for late reply,

Support for languages other than English is definitely something that I want lunr.js to be able to do. Currently the stemmer and stop word filter are quite easily replaceable. They are simply functions that are added into a pipeline.

To get an index with an empty pipeline you need to not use the convenience wrapper:

var idx = new lunr.Index

idx.field('title')

idx.pipeline.add(customStemmer)

You can also remove existing pipeline functions:

idx.pipeline.remove(lunr.stemmer)

I wasn't aware of those libraries, I'll definitely take a look at those and see how their stemmers etc compare, I'm positive there can be improvements made to the ones currently included in lunr.js

ssured commented 11 years ago

This is some code I used with the snowball stemmer library available at https://github.com/fortnightlabs/snowball-js/blob/master/stemmer/lib/Snowball.js. I prefer to work in ISO 639-1 two character language codes, so first we need a conversion table

var isoToSnowball = {
  da:'Danish',
  nl:'Dutch',
  en:'English',
  fi:'Finnish',
  fr:'French',
  de:'German',
  hu:'Hungarian',
  it:'Italian',
  nn:'Norwegian',
  pt:'Portuguese',
  ro:'Romanian',
  ru:'Russian',
  es:'Spanish',
  sv:'Swedish',
  tr:'Turkish'
};

Now we can use the following code to stem in any of the above languages

stemmer = new Snowball(isoToSnowball[lang]);

names = names.map(function(token) {
  stemmer.setCurrent(token);
  if (stemmer.stem()) {
    return stemmer.getCurrent();
  } else {
    return token;
  }
});

One other step which is needed is the stopword removal. See below a dataset of what snowball provides:

// BSD Licensed: http://snowball.tartarus.org/license.php
// downloaded from http://snowball.tartarus.org/algorithms/[danish,dutch,english,...]/stop.txt
// To get the data: copy the following command in the console and hit enter. Manually copy pasting is easy
// document.getElementsByTagName('pre')[0].innerText.split("\n").map(function(line){return line.split('|')[0].replace(/^\s+|\s+$/g, '')}).filter(function(word){return word!=''}).reduce(function(set,words){return set.concat(words.split(/\s+/g));},[]).join(' ')

var isoSnowballStopwords = {
  da:"og i jeg det at en den til er som på de med han af for ikke der var mig sig men et har om vi min havde ham hun nu over da fra du ud sin dem os op man hans hvor eller hvad skal selv her alle vil blev kunne ind når være dog noget ville jo deres efter ned skulle denne end dette mit også under have dig anden hende mine alt meget sit sine vor mod disse hvis din nogle hos blive mange ad bliver hendes været thi jer sådan".split(" ").map(replaceDiacritics),
  nl:"de en van ik te dat die in een hij het niet zijn is was op aan met als voor had er maar om hem dan zou of wat mijn men dit zo door over ze zich bij ook tot je mij uit der daar haar naar heb hoe heeft hebben deze u want nog zal me zij nu ge geen omdat iets worden toch al waren veel meer doen toen moet ben zonder kan hun dus alles onder ja eens hier wie werd altijd doch wordt wezen kunnen ons zelf tegen na reeds wil kon niets uw iemand geweest andere".split(" ").map(replaceDiacritics),
  en:"i me my myself we our ours ourselves you your yours yourself yourselves he him his himself she her hers herself it its itself they them their theirs themselves what which who whom this that these those am is are was were be been being have has had having do does did doing would should could ought i'm you're he's she's it's we're they're i've you've we've they've i'd you'd he'd she'd we'd they'd i'll you'll he'll she'll we'll they'll isn't aren't wasn't weren't hasn't haven't hadn't doesn't don't didn't won't wouldn't shan't shouldn't can't cannot couldn't mustn't let's that's who's what's here's there's when's where's why's how's a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very".split(" ").map(replaceDiacritics),
  fi:"olla olen olet on olemme olette ovat ole oli olisi olisit olisin olisimme olisitte olisivat olit olin olimme olitte olivat ollut olleet en et ei emme ette eivät minä minun minut minua minussa minusta minuun minulla minulta minulle sinä sinun sinut sinua sinussa sinusta sinuun sinulla sinulta sinulle hän hänen hänet häntä hänessä hänestä häneen hänellä häneltä hänelle me meidän meidät meitä meissä meistä meihin meillä meiltä meille te teidän teidät teitä teissä teistä teihin teillä teiltä teille he heidän heidät heitä heissä heistä heihin heillä heiltä heille tämä tämän tätä tässä tästä tähän tällä tältä tälle tänä täksi tuo tuon tuota tuossa tuosta tuohon tuolla tuolta tuolle tuona tuoksi se sen sitä siinä siitä siihen sillä siltä sille sinä siksi nämä näiden näitä näissä näistä näihin näillä näiltä näille näinä näiksi nuo noiden noita noissa noista noihin noilla noilta noille noina noiksi ne niiden niitä niissä niistä niihin niillä niiltä niille niinä niiksi kuka kenen kenet ketä kenessä kenestä keneen kenellä keneltä kenelle kenenä keneksi ketkä keiden ketkä keitä keissä keistä keihin keillä keiltä keille keinä keiksi mikä minkä minkä mitä missä mistä mihin millä miltä mille minä miksi mitkä joka jonka jota jossa josta johon jolla jolta jolle jona joksi jotka joiden joita joissa joista joihin joilla joilta joille joina joiksi että ja jos koska kuin mutta niin sekä sillä tai vaan vai vaikka kanssa mukaan noin poikki yli kun niin nyt itse".split(" ").map(replaceDiacritics),
  fr:"au aux avec ce ces dans de des du elle en et eux il je la le leur lui ma mais me même mes moi mon ne nos notre nous on ou par pas pour qu que qui sa se ses son sur ta te tes toi ton tu un une vos votre vous c d j l à m n s t y été étée étées étés étant suis es est sommes êtes sont serai seras sera serons serez seront serais serait serions seriez seraient étais était étions étiez étaient fus fut fûmes fûtes furent sois soit soyons soyez soient fusse fusses fût fussions fussiez fussent ayant eu eue eues eus ai as avons avez ont aurai auras aura aurons aurez auront aurais aurait aurions auriez auraient avais avait avions aviez avaient eut eûmes eûtes eurent aie aies ait ayons ayez aient eusse eusses eût eussions eussiez eussent ceci celà cet cette ici ils les leurs quel quels quelle quelles sans soi".split(" ").map(replaceDiacritics),
  de:"aber alle allem allen aller alles als also am an ander andere anderem anderen anderer anderes anderm andern anderr anders auch auf aus bei bin bis bist da damit dann der den des dem die das daß derselbe derselben denselben desselben demselben dieselbe dieselben dasselbe dazu dein deine deinem deinen deiner deines denn derer dessen dich dir du dies diese diesem diesen dieser dieses doch dort durch ein eine einem einen einer eines einig einige einigem einigen einiger einiges einmal er ihn ihm es etwas euer eure eurem euren eurer eures für gegen gewesen hab habe haben hat hatte hatten hier hin hinter ich mich mir ihr ihre ihrem ihren ihrer ihres euch im in indem ins ist jede jedem jeden jeder jedes jene jenem jenen jener jenes jetzt kann kein keine keinem keinen keiner keines können könnte machen man manche manchem manchen mancher manches mein meine meinem meinen meiner meines mit muss musste nach nicht nichts noch nun nur ob oder ohne sehr sein seine seinem seinen seiner seines selbst sich sie ihnen sind so solche solchem solchen solcher solches soll sollte sondern sonst über um und uns unse unsem unsen unser unses unter viel vom von vor während war waren warst was weg weil weiter welche welchem welchen welcher welches wenn werde werden wie wieder will wir wird wirst wo wollen wollte würde würden zu zum zur zwar zwischen".split(" ").map(replaceDiacritics),
  hu:"a ahogy ahol aki akik akkor alatt által általában amely amelyek amelyekben amelyeket amelyet amelynek ami amit amolyan amíg amikor át abban ahhoz annak arra arról az azok azon azt azzal azért aztán azután azonban bár be belül benne cikk cikkek cikkeket csak de e eddig egész egy egyes egyetlen egyéb egyik egyre ekkor el elég ellen elõ elõször elõtt elsõ én éppen ebben ehhez emilyen ennek erre ez ezt ezek ezen ezzel ezért és fel felé hanem hiszen hogy hogyan igen így illetve ill. ill ilyen ilyenkor ison ismét itt jó jól jobban kell kellett keresztül keressünk ki kívül között közül legalább lehet lehetett legyen lenne lenni lesz lett maga magát majd majd már más másik meg még mellett mert mely melyek mi mit míg miért milyen mikor minden mindent mindenki mindig mint mintha mivel most nagy nagyobb nagyon ne néha nekem neki nem néhány nélkül nincs olyan ott össze õ õk õket pedig persze rá s saját sem semmi sok sokat sokkal számára szemben szerint szinte talán tehát teljes tovább továbbá több úgy ugyanis új újabb újra után utána utolsó vagy vagyis valaki valami valamint való vagyok van vannak volt voltam voltak voltunk vissza vele viszont volna".split(" ").map(replaceDiacritics),
  it:"ad al allo ai agli all agl alla alle con col coi da dal dallo dai dagli dall dagl dalla dalle di del dello dei degli dell degl della delle in nel nello nei negli nell negl nella nelle su sul sullo sui sugli sull sugl sulla sulle per tra contro io tu lui lei noi voi loro mio mia miei mie tuo tua tuoi tue suo sua suoi sue nostro nostra nostri nostre vostro vostra vostri vostre mi ti ci vi lo la li le gli ne il un uno una ma ed se perché anche come dov dove che chi cui non più quale quanto quanti quanta quante quello quelli quella quelle questo questi questa queste si tutto tutti a c e i l o ho hai ha abbiamo avete hanno abbia abbiate abbiano avrò avrai avrà avremo avrete avranno avrei avresti avrebbe avremmo avreste avrebbero avevo avevi aveva avevamo avevate avevano ebbi avesti ebbe avemmo aveste ebbero avessi avesse avessimo avessero avendo avuto avuta avuti avute sono sei è siamo siete sia siate siano sarò sarai sarà saremo sarete saranno sarei saresti sarebbe saremmo sareste sarebbero ero eri era eravamo eravate erano fui fosti fu fummo foste furono fossi fosse fossimo fossero essendo faccio fai facciamo fanno faccia facciate facciano farò farai farà faremo farete faranno farei faresti farebbe faremmo fareste farebbero facevo facevi faceva facevamo facevate facevano feci facesti fece facemmo faceste fecero facessi facesse facessimo facessero facendo sto stai sta stiamo stanno stia stiate stiano starò starai starà staremo starete staranno starei staresti starebbe staremmo stareste starebbero stavo stavi stava stavamo stavate stavano stetti stesti stette stemmo steste stettero stessi stesse stessimo stessero stando".split(" ").map(replaceDiacritics),
  nn:"og i jeg det at en et den til er som på de med han av ikke ikkje der så var meg seg men ett har om vi min mitt ha hadde hun nå over da ved fra du ut sin dem oss opp man kan hans hvor eller hva skal selv sjøl her alle vil bli ble blei blitt kunne inn når være kom noen noe ville dere som deres kun ja etter ned skulle denne for deg si sine sitt mot å meget hvorfor dette disse uten hvordan ingen din ditt blir samme hvilken hvilke sånn inni mellom vår hver hvem vors hvis både bare enn fordi før mange også slik vært være båe begge siden dykk dykkar dei deira deires deim di då eg ein eit eitt elles honom hjå ho hoe henne hennar hennes hoss hossen ikkje ingi inkje korleis korso kva kvar kvarhelst kven kvi kvifor me medan mi mine mykje no nokon noka nokor noko nokre si sia sidan so somt somme um upp vere vore verte vort varte vart".split(" ").map(replaceDiacritics),
  pt:"de a o que e do da em um para com não uma os no se na por mais as dos como mas ao ele das à seu sua ou quando muito nos já eu também só pelo pela até isso ela entre depois sem mesmo aos seus quem nas me esse eles você essa num nem suas meu às minha numa pelos elas qual nós lhe deles essas esses pelas este dele tu te vocês vos lhes meus minhas teu tua teus tuas nosso nossa nossos nossas dela delas esta estes estas aquele aquela aqueles aquelas isto aquilo estou está estamos estão estive esteve estivemos estiveram estava estávamos estavam estivera estivéramos esteja estejamos estejam estivesse estivéssemos estivessem estiver estivermos estiverem hei há havemos hão houve houvemos houveram houvera houvéramos haja hajamos hajam houvesse houvéssemos houvessem houver houvermos houverem houverei houverá houveremos houverão houveria houveríamos houveriam sou somos são era éramos eram fui foi fomos foram fora fôramos seja sejamos sejam fosse fôssemos fossem for formos forem serei será seremos serão seria seríamos seriam tenho tem temos tém tinha tínhamos tinham tive teve tivemos tiveram tivera tivéramos tenha tenhamos tenham tivesse tivéssemos tivessem tiver tivermos tiverem terei terá teremos terão teria teríamos teriam".split(" ").map(replaceDiacritics),
  ro:[], // not available
  ru:[], // unknown encoding
  es:"de la que el en y a los del se las por un para con no una su al lo como más pero sus le ya o este sí porque esta entre cuando muy sin sobre también me hasta hay donde quien desde todo nos durante todos uno les ni contra otros ese eso ante ellos e esto mí antes algunos qué unos yo otro otras otra él tanto esa estos mucho quienes nada muchos cual poco ella estar estas algunas algo nosotros mi mis tú te ti tu tus ellas nosotras vosotros vosotras os mío mía míos mías tuyo tuya tuyos tuyas suyo suya suyos suyas nuestro nuestra nuestros nuestras vuestro vuestra vuestros vuestras esos esas estoy estás está estamos estáis están esté estés estemos estéis estén estaré estarás estará estaremos estaréis estarán estaría estarías estaríamos estaríais estarían estaba estabas estábamos estabais estaban estuve estuviste estuvo estuvimos estuvisteis estuvieron estuviera estuvieras estuviéramos estuvierais estuvieran estuviese estuvieses estuviésemos estuvieseis estuviesen estando estado estada estados estadas estad he has ha hemos habéis han haya hayas hayamos hayáis hayan habré habrás habrá habremos habréis habrán habría habrías habríamos habríais habrían había habías habíamos habíais habían hube hubiste hubo hubimos hubisteis hubieron hubiera hubieras hubiéramos hubierais hubieran hubiese hubieses hubiésemos hubieseis hubiesen habiendo habido habida habidos habidas soy eres es somos sois son sea seas seamos seáis sean seré serás será seremos seréis serán sería serías seríamos seríais serían era eras éramos erais eran fui fuiste fue fuimos fuisteis fueron fuera fueras fuéramos fuerais fueran fuese fueses fuésemos fueseis fuesen siendo sido tengo tienes tiene tenemos tenéis tienen tenga tengas tengamos tengáis tengan tendré tendrás tendrá tendremos tendréis tendrán tendría tendrías tendríamos tendríais tendrían tenía tenías teníamos teníais tenían tuve tuviste tuvo tuvimos tuvisteis tuvieron tuviera tuvieras tuviéramos tuvierais tuvieran tuviese tuvieses tuviésemos tuvieseis tuviesen teniendo tenido tenida tenidos tenidas tened".split(" ").map(replaceDiacritics),
  sv:"och det att i en jag hon som han på den med var sig för så till är men ett om hade de av icke mig du henne då sin nu har inte hans honom skulle hennes där min man ej vid kunde något från ut när efter upp vi dem vara vad över än dig kan sina här ha mot alla under någon eller allt mycket sedan ju denna själv detta åt utan varit hur ingen mitt ni bli blev oss din dessa några deras blir mina samma vilken er sådan vår blivit dess inom mellan sådant varför varje vilka ditt vem vilket sitta sådana vart dina vars vårt våra ert era vilkas".split(" ").map(replaceDiacritics),
  tr:[]  // not available
};

Where replaceDiacritics is provided by https://github.com/yvg/js-replace-diacritics/pull/1. Please note my pull request is included, as unicode 0130 is the way OSX writes accents when using the <opt> key.

olivernn commented 11 years ago

Glad that there are good resources for so many languages!

Will indexes normally contain many languages? I'm not sure they will. I think in the interests of keeping lunr more compact, languages should probably be some kind of plugin or extension. You could then build a language specific version of lunr. I think English should still be the default, but it should be easy to swap English out for whatever language you want to use.

Thoughts?

ssured commented 11 years ago

I was surprised about the availability of so many resources as well. My setup uses multiple languages in the search, but all have their own index. Supporting multiple languages thus seems necessary to me. Thinking of a setup this comes to mind:

My preference would be to split lunr.en.js out of lunr.js, as for every install in a different language the english stemmer/stopwords will be obsolete.

olivernn commented 11 years ago

I've put together a really early draft of how a language adapter might look, the first adapter is in Russian - https://github.com/olivernn/lunr.ru.js, there is an example use case too - https://github.com/olivernn/lunr.ru.js/blob/master/index.html

This makes use of a stemmer from https://github.com/NaturalNode/natural (which is a really great resource thanks @gmarty for pointing me in that direction).

There are some changes required in lunr for this to work, specifically trimming tokens needs to be taken out of the tokenizer and put into a separate pipeline function. This is because for English it can use a simple regex to trim non-word characters from the beginning and end of a token but the \W character is not unicode aware. This is something that needs looking at because I'd like to be able trim tokens of leading and trailing punctuation in all languages, any suggestions would be greatly appreciated!

Language adapters will be a 'plugin', so you would use them like so:

var idx = lunr(function () {
  this.use(lunr.ru)
  this.field('title')
})

It'd be great to get some more feedback on this, I can only speak English so I'll find it hard to asses how well these different language adapters work. I'll eventually add as many adapters as there are good stemmers and stop word lists, but if there is a language you'd like to try lunr with please let me know and I'll focus on that first.

severinh commented 11 years ago

First, let me thank you for creating lunr.js! It's a perfect fit for adding full-text search capabilities to a statically generated site I'm currently building.

I've created a German language extension for lunr.js. You can find it here: https://github.com/severinh/lunr.de.js I'll adapt it as soon as there will be an official API for language extensions.

olivernn commented 11 years ago

@severinh awesome work on the German language extension! It will need a couple of small changes in how it integrates, but it looks like excellent work, thanks!

I'll push the changes required for lunr in a branch here so you can get a head start on integrating your extension.

olivernn commented 11 years ago

The i18n branch has the changes to support multiple languages.

shredding commented 10 years ago

I want to use lunr for a german index, but am unsure about the status of this ticket. Any recommendations?

Thierry36tribus commented 10 years ago

Thanks for the great lunr! I want to use lunr for a french index, is there a place where I can find an up-to-date "how-to"? Thanks

olivernn commented 10 years ago

@Thierry36tribus I've just pushed a French language extension for lunr. You will need to build and use the version of lunr that is in the i18n branch.

To use the plugin:

lunr(function () {
  this.use(lunr.fr)
  this.field('whatever')
})

I'm keen to get some feedback on this, both the French language extension (I don't speak French so can't comment on how good it is) and using lunr with a language other than English. Once everything is as it should be the i18n branch will become version 0.5.0.

As for "how-to" docs etc, I'm working on much better documentation, including some higher level "guide" like docs that will hopefully go some way to help people get up to speed with the concepts as well as the API.

Thierry36tribus commented 10 years ago

Thanks a lot! I'll try it and give you some feedback.

Thierry36tribus commented 10 years ago

Sorry for the delay. I did a quick test : take lunr.js from i18n branch and lunr.fr.js. I had an error: the function use is undefined in Index. It exists in lib/index but not in lunr.js. I need to rebuild lunr, I suppose. I'll try again as soon as possible...

olivernn commented 10 years ago

Yeah sorry, that branch doesn't have a built version of the code. You can build the non-minified version without any of the build dependencies by doing a make lunr.js.

Thierry36tribus commented 10 years ago

Thanks.

I have now this error when searching : Uncaught TypeError: Cannot read property 'tf' of undefined lunri18n.js:1120 lunr.Index.documentVector lunri18n.js:1120 (anonymous function) lunri18n.js:1092 lunr.SortedSet.map lunri18n.js:588 lunr.Index.search lunri18n.js:1091 (anonymous function)

my code :

var index = lunr(function () { this.use(lunr.fr)

    this.field('title')
    this.field('subTitle')
    this.field('description')
    this.ref('id')
})

2014-03-10 17:16 GMT+01:00 Oliver Nightingale notifications@github.com:

Yeah sorry, that branch doesn't have a built version of the code. You can build the non-minified version without any of the build dependencies by doing a make lunr.js.

Reply to this email directly or view it on GitHubhttps://github.com/olivernn/lunr.js/issues/16#issuecomment-37200831 .

Thierry Vallée thierry@36tribus.com 06 88 36 22 25 www.36tribus.com

olivernn commented 10 years ago

Hmm, what were you searching for? Difficult to work out what went wrong without some more info or a test case.

I've just pushed an example for the french language plugin and my quick testing locally doesn't produce any errors.

olivernn commented 10 years ago

The latest release includes support for indexing languages other than English.

Currently there are language adaptors for German, French and Russian.

To use a language extension set up your index like so (this example uses the French language extension):

var idx = lunr(function () {
    this.field('title')

    this.use(lunr.fr)
})

The index idx will now have a French language specific stemmer and stop word filter.

See the implementations of the German, French and Russian extensions for details on how to write extensions in other languages.

If you create a language extension, please let me know, or add it to this issue. At some point I will be updating the lunr website and will link to them from there. Also if you are interested in maintaining either the French or Russian extensions please let me know, my abilities to maintain these are hampered by only speaking English!

MihaiValentin commented 10 years ago

Here is a list of many languages for Lunr (German, French, Spanish, Italian, Dutch, Danish, Portuguese, Finnish, Romanian, Hungarian, Russian, Norwegian):

https://github.com/MihaiValentin/lunr-languages

olivernn commented 10 years ago

@MihaiValentin that is awesome work, thanks a bunch for providing these. I'll be sure to feature your project when I eventually get round to fixing up the lunr.js site!

jlambe commented 9 years ago

Hi @olivernn. Found your project and I was wondering if the latest release supports other languages or we still need to fetch the i18n branch? Also, I can't find a way to use the stopWord filter function from the docs, do you have an example on how to add stop words before indexing content? Thanks