Open longhotsummer opened 1 year ago
From a quality standpoint, in some locations OCR is not happening before upload or is poor. This means citations aren't extracted well (so citation relationships are lost) and search performs poorly.
The full title of the document is "Constitutive Act of the African Union", so the "AU" term throws it, because it's not in the title.
Note that searching for "constitutive act" does show it as the first hit for the AU, but it is pushed down the results by all the SA constitution results. This may be improved by #1073
Searching with both the title and the citation doesn't work well, because no field contains both of these values. eg. airports company act 44 of 1993
A user emailed just with the term: "evidence law uganda". On ulii.org, searching for "evidence law" has really poor results. Searching for "evidence act" or just "evidence" has better legislation results.
Some thoughts:
Use minimum should match on phrase fields to allow OR but most those that have most results https://www.elastic.co/guide/en/elasticsearch/guide/current/match-multi-word.html -- NB this could make a huge difference
index full text fields into multiple indexes that use stemming and shingles. Combine those to boost better results - https://www.elastic.co/guide/en/elasticsearch/guide/current/most-fields.html
Shingles help to find words close together https://www.elastic.co/guide/en/elasticsearch/guide/current/_closer_is_better.html
see https://medium.com/unacademy-engineering/making-search-relevant-using-elasticsearch-a7b546c9a72a
changing to OR rather than AND, and using minimum_should_match=70%
fixes these searches:
When searching for legislation, if a TOC title is a strong match, show the TOC breakdown and/or show the TOC breakdown for general legislation text matches.
For example, "small-scale mining licenses" is a TOC entry in https://sierralii.gov.sl/akn/sl/act/2023/16/eng@2023-05-12 and should be very prominent for https://sierralii.gov.sl/search/?q=small-scale+mining+licence&doc_type=Legislation
The pagerank weightings aren't currently very useful (on lawlibrary at least). For example, a search for "children act" puts the amendment above the principal act with almost identical scores.
Looking into the pagerank part of the score:
Principal act:
{
"value": 49.992573,
"description": "Saturation function on the _feature field for the ranking feature, computed as w * S / (S + k) from:",
"details": [
{
"value": 50,
"description": "w, weight of this function",
"details": []
},
{
"value": 1.671724e-7,
"description": "k, pivot feature value that would give a score contribution equal to w/2",
"details": []
},
{
"value": 0.0011253357,
"description": "S, feature value",
"details": []
}
]
},
Amendment act:
{
"value": 49.66029,
"description": "Saturation function on the _feature field for the ranking feature, computed as w * S / (S + k) from:",
"details": [
{
"value": 50,
"description": "w, weight of this function",
"details": []
},
{
"value": 1.671724e-7,
"description": "k, pivot feature value that would give a score contribution equal to w/2",
"details": []
},
{
"value": 0.000024437904,
"description": "S, feature value",
"details": []
}
]
}
You can see that the total contribution to the scores is almost identical, even though the pagerank value is very different (0.0011 vs 0.000024).
This is because the k
value chosen automatically by ES is very small. This seems to be because so many entries have a value of zero. If we calculate the geometric mean without the zeros, it's about 100 times bigger, and that different is meaningful (0,0000219612244) - the pagerank boosts become 49 and 26 respectively.
So an option is to calculate the geometric mean and store than in pj_settings, and then inject that into the query.
https://github.com/laws-africa/peachjam/pull/1647 boosts titles more, and prevents double-scoring for documents without a citation. This makes a search for "protection of information act" have POPIA at the top, as it should be.
See https://github.com/laws-africa/peachjam/issues/1655 for a case where missing a space in the title of a document results in poor search results.
Highlighting on exact phrases can be misleading, eg "arms and ammunition"
only highlights "arms" and "ammunition". Can we adjust highlighting to be a bit clearer.
THIS IS FIXED: the Swahili search index needed to be rebuilt.
Searching for "constitution" or "the constitution" on tanzlii doesn't do a good enough job of matching on the Swahili constitution:
Katiba ya Jamhuri ya Muungano wa Tanzania, ya Mwaka
On new.kenyalaw.org:
"extension of probation for 6 months"
In their old Google-based search does a better job than peachjam.
"illegally obtained evidence" on new.kenyalaw.org does a much better job with quotes than without.
Trying to find the "maputo protocol" on africanlii.org is really difficult! https://africanlii.org/search/?q=protocol+to+the+african+charter+on+human+and+people%27s+rights+of+women+in+africa -- even just "protocol to the african charter" has really poor results.
Searching for R vs Jean
on seylii link has a good match on the first two, but the case R v Jean
(which should be a good match) is at number 6.
Can we do a better job knowing that v
and vs
are synonyms, and that R
is important? Is ES ignoring it because it's a single character?
Similarly, R v Jean
seems to push the real R v Jean
quite low link
"R v Jean"
in quotes has much stricter results.
Handling common synonyms may be very useful:
R
- republicAnor
- anotherOrs
- othersv
and vs
https://seylii.org/search/?q=Government+of+Seychelles+v+Chang-Tave+%26+Ors
The better match is the second one, with the words in the exact order. Odd that the first match has them in a different order.
If there are a small number of terms, eg "restraint of trade", then short fields like title should contain all of them; otherwise "trading" gets boosted incorrectly.
Meta issue for discussing search improvements.
Please add examples of searches where users didn't get what they had hoped for. Include: