inveniosoftware / invenio

Invenio digital library framework
https://invenio.readthedocs.io
MIT License
621 stars 291 forks source link

search misses accented chars #455

Closed romanchyla closed 9 years ago

romanchyla commented 10 years ago

Originally on 2011-01-24

In CDS, searching for:

physik fuer ingenieure physik fur ingenieure Physik für Ingenieure

works not, what works is this:

physik f\"ur ingenieure

which is very wierd.

tiborsimko commented 10 years ago

Originally on 2011-01-24

1) When I search for physik fur ingenieure and for physik f\"ur ingenieure on CDS, I seem to be getting the same 14 hits. Can you please make the example more concrete in order to see which record was supposed to be found but it was not? Kind of like recid:124 AND physik that should have found record 124 but it did not, for example. In any case, I do not seem to be able to reproduce this problem.

2) Bug reports concerning CDS only, i.e. the CERN instance of Invenio only, are better submitted to the dedicated CERN Savannah support tracker at [[https://savannah.cern.ch/support/?group=cdsware]].

romanchyla commented 10 years ago

Originally on 2011-01-25

Oups, sorry, my bad - i checked again and the queries are fine, besides this version (which the german user used as first):

Physik fuer Ingenieure and recid:112675

The other two are fine. Though interestingly, there are two groups

Physik f\"ur --> 974 hits Physik fur --> 974 hits Physik fuer --> 510 hits

ps: thank you for the link

tiborsimko commented 10 years ago

Originally on 2011-02-04

OK, so the problem is that this record contains the wordfür, and that it can be well found viafur, but not viafuer.

This is actually how CDS behaves by design: many years ago in a common discussion with the library on how to index accented letters it was decided to simply strip Latin-1 accents. Hencefür is indexed asfur only, andfuer does not find it.

I agree that we may want to alter this behaviour...

invenio-developers commented 10 years ago

Originally by arwagner on 2013-11-26

I may add that there is another "accented character" fo this type entirely missing in the list. Namely the ß ## ss sz in German language (except Switzerland who abandoned this char I think.)

Especially in names it would be great if one could capture this as well.

jirikuncar commented 9 years ago

@kaplun has there been any progress with custom tokenizers for bibindex supporting mentioned characters?

tiborsimko commented 9 years ago

I'm not sure whether multiple alternative transliterations for a term are possible with recent INSPIRE improvements... "für -> fur, fuer" is the main issue at hand here. In theory, people could write their own tokenisers to achieve this. In practice, we can muse whether Invenio should do this by default...

tiborsimko commented 9 years ago

There is no PR for this issue, hence closing it as per the legacy code base freeze; it is addressed in master code base differently.