laws-africa / peachjam

Project Peach Jam
https://agp.africanlii.org
GNU General Public License v3.0
2 stars 0 forks source link

Main issue for search improvements #1161

Open longhotsummer opened 1 year ago

longhotsummer commented 1 year ago

Meta issue for discussing search improvements.

Please add examples of searches where users didn't get what they had hoped for. Include:

longhotsummer commented 1 year ago

From a quality standpoint, in some locations OCR is not happening before upload or is poor. This means citations aren't extracted well (so citation relationships are lost) and search performs poorly.

longhotsummer commented 1 year ago

The full title of the document is "Constitutive Act of the African Union", so the "AU" term throws it, because it's not in the title.

Note that searching for "constitutive act" does show it as the first hit for the AU, but it is pushed down the results by all the SA constitution results. This may be improved by #1073

longhotsummer commented 1 year ago

Searching with both the title and the citation doesn't work well, because no field contains both of these values. eg. airports company act 44 of 1993

Fixed in: https://github.com/laws-africa/peachjam/pull/1228

longhotsummer commented 1 year ago

A user emailed just with the term: "evidence law uganda". On ulii.org, searching for "evidence law" has really poor results. Searching for "evidence act" or just "evidence" has better legislation results.

longhotsummer commented 1 year ago

Some thoughts:

longhotsummer commented 1 year ago

changing to OR rather than AND, and using minimum_should_match=70% fixes these searches:

longhotsummer commented 1 year ago

When searching for legislation, if a TOC title is a strong match, show the TOC breakdown and/or show the TOC breakdown for general legislation text matches.

For example, "small-scale mining licenses" is a TOC entry in https://sierralii.gov.sl/akn/sl/act/2023/16/eng@2023-05-12 and should be very prominent for https://sierralii.gov.sl/search/?q=small-scale+mining+licence&doc_type=Legislation

longhotsummer commented 11 months ago

The pagerank weightings aren't currently very useful (on lawlibrary at least). For example, a search for "children act" puts the amendment above the principal act with almost identical scores.

Looking into the pagerank part of the score:

Principal act:

{
        "value": 49.992573,
        "description": "Saturation function on the _feature field for the ranking feature, computed as w * S / (S + k) from:",
        "details": [
          {
            "value": 50,
            "description": "w, weight of this function",
            "details": []
          },
          {
            "value": 1.671724e-7,
            "description": "k, pivot feature value that would give a score contribution equal to w/2",
            "details": []
          },
          {
            "value": 0.0011253357,
            "description": "S, feature value",
            "details": []
          }
        ]
      },

Amendment act:

{
        "value": 49.66029,
        "description": "Saturation function on the _feature field for the ranking feature, computed as w * S / (S + k) from:",
        "details": [
          {
            "value": 50,
            "description": "w, weight of this function",
            "details": []
          },
          {
            "value": 1.671724e-7,
            "description": "k, pivot feature value that would give a score contribution equal to w/2",
            "details": []
          },
          {
            "value": 0.000024437904,
            "description": "S, feature value",
            "details": []
          }
        ]
      }

You can see that the total contribution to the scores is almost identical, even though the pagerank value is very different (0.0011 vs 0.000024).

This is because the k value chosen automatically by ES is very small. This seems to be because so many entries have a value of zero. If we calculate the geometric mean without the zeros, it's about 100 times bigger, and that different is meaningful (0,0000219612244) - the pagerank boosts become 49 and 26 respectively.

So an option is to calculate the geometric mean and store than in pj_settings, and then inject that into the query.

longhotsummer commented 11 months ago

https://github.com/laws-africa/peachjam/pull/1647 boosts titles more, and prevents double-scoring for documents without a citation. This makes a search for "protection of information act" have POPIA at the top, as it should be.

longhotsummer commented 10 months ago

See https://github.com/laws-africa/peachjam/issues/1655 for a case where missing a space in the title of a document results in poor search results.

longhotsummer commented 8 months ago

Highlighting on exact phrases can be misleading, eg "arms and ammunition" only highlights "arms" and "ammunition". Can we adjust highlighting to be a bit clearer.

longhotsummer commented 1 month ago

THIS IS FIXED: the Swahili search index needed to be rebuilt.

Searching for "constitution" or "the constitution" on tanzlii doesn't do a good enough job of matching on the Swahili constitution:

Katiba ya Jamhuri ya Muungano wa Tanzania, ya Mwaka

https://tanzlii.org/akn/tz/act/1977/1/swa@2002-07-31

longhotsummer commented 3 weeks ago

On new.kenyalaw.org:

"extension of probation for 6 months"

In their old Google-based search does a better job than peachjam.

image

image

longhotsummer commented 3 weeks ago

"illegally obtained evidence" on new.kenyalaw.org does a much better job with quotes than without.

longhotsummer commented 2 weeks ago

Trying to find the "maputo protocol" on africanlii.org is really difficult! https://africanlii.org/search/?q=protocol+to+the+african+charter+on+human+and+people%27s+rights+of+women+in+africa -- even just "protocol to the african charter" has really poor results.

longhotsummer commented 2 weeks ago

Searching for R vs Jean on seylii link has a good match on the first two, but the case R v Jean (which should be a good match) is at number 6.

Can we do a better job knowing that v and vs are synonyms, and that R is important? Is ES ignoring it because it's a single character?

image

Similarly, R v Jean seems to push the real R v Jean quite low link

image

"R v Jean" in quotes has much stricter results.

longhotsummer commented 2 weeks ago

Handling common synonyms may be very useful:

longhotsummer commented 1 week ago

https://seylii.org/search/?q=Government+of+Seychelles+v+Chang-Tave+%26+Ors

The better match is the second one, with the words in the exact order. Odd that the first match has them in a different order.

image

longhotsummer commented 1 week ago

If there are a small number of terms, eg "restraint of trade", then short fields like title should contain all of them; otherwise "trading" gets boosted incorrectly.