trep / opentrep

Open Travel Request Parser
https://trep.github.io/opentrep
GNU Lesser General Public License v2.1
12 stars 5 forks source link

Airport/city codes should take the precedence over alternate names #3

Closed da115115 closed 10 years ago

da115115 commented 11 years ago

When seaching for KHI, the city of Jakarta, Indonesia (ID), is returned, whereas KHI is the code of Karachi, Pakistan (PK).

The cause is that:

  1. The Hakka Chinese translation of Jakarta is "Ngâ-kâ-tha̍t Sú-tû Thi̍t-khî", including the "khî" keyword, which is therefore part of the indexing keywords for Jakarta city.
  2. The respective PageRank values of KHI and JKT/CGK are 8% and 36.5%. Hence, KHI will match almost exactly (99.99%) with both Karachi and Jakarta. With the PageRank values, Jakarta comes out with an overall matching weight of ~36% (compared to the ~8% of Karachi).
da115115 commented 11 years ago

When 3-letter/4-letter codes are entered, and they those match with a IATA, ICAO or FAA code, the corresponding POR entries should be selected. One way to do that is to set the final matching weight (after application of PageRank) to 100%.

da115115 commented 11 years ago

Though working quite nicely, there are still some imperfections, some of which may be simply fixed. More specifically, the http://search-travel.org/search/?q=sez+airport request yields "Mahé (SEZ) -- Vadsø Airport (VDS)", due to the fact that {'sez', 'airport'} better matches than {'sez airport'}, as 'sez' now matches with 100% and 'airport' matches with 2.45%. Several options in order to improve the algorithm:

da115115 commented 10 years ago

When indexing a term, the weight may be specified; see the documentation of Xapian::TermGenerator::index_text() function for more details. By default, a weight of 1 is applied. For IATA/ICAO/FAA codes, a weight of 2 may be applied. Note that, for now, the Place source code implements the indexation in a generic way through a generic list of terms stored in Place. We could therefore have a map of list of terms: the key would be the weight and the value would be the list of terms:

  typdef <const Weight_T, StringSet_T> TermSetMap_T;
  TermSetMap_T _termSetMap;

Then, in the Xapian indexing code, the terms would be indexed according to their weight.

da115115 commented 10 years ago

Implemented as for the issue #5 (Indexing travel types with PageRank weights).