mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
387 stars 195 forks source link

Restore the ability to see request list and search results beyond the first 500 #2137

Open crowbot opened 9 years ago

crowbot commented 9 years ago

We have found that xapian queries with high offsets are extremely slow to execute, and so have restricted the number of search and list results that can be returned (https://github.com/mysociety/alaveteli/commit/bf66cd1d1d4faa249c692f13543e43f2bd6b0c03, https://github.com/mysociety/alaveteli/commit/a37e9f21f00af03d271cb40de7d849cb8941bc02, https://github.com/mysociety/alaveteli/commit/a1655ba57f94f2a0058dc03cf1da67b7a763eb49). It would be better to find some way to be able to execute these high offset searches in a reasonable amount of time.

garethrees commented 9 years ago

This seems to be an issue with either acts_as_xapian – or Xapian itself – rather than something we're doing with the results afterwards:

Benchmark.bmbm do |x|
  x.report("offset-0") {
    result = ActsAsXapian::Search.new([InfoRequestEvent], 'information',
        :offset => 0,
        :limit => 25,
        :sort_by_prefix => 'created_at',
        :sort_by_ascending => true,
        :collapse_by_prefix => 'request_collapse'
    )
  }
  x.report("offset-500") {
    result = ActsAsXapian::Search.new([InfoRequestEvent], 'information',
        :offset => 500,
        :limit => 25,
        :sort_by_prefix => 'created_at',
        :sort_by_ascending => true,
        :collapse_by_prefix => 'request_collapse'
    )
  }
end

# Rehearsal ----------------------------------------------
# offset-0     4.700000   0.020000   4.720000 (  4.724534)
# offset-500  36.230000   0.160000  36.390000 ( 36.513045)
# ------------------------------------ total: 41.110000sec
# 
#                  user     system      total        real
# offset-0     5.150000   0.020000   5.170000 (  5.176799)
# offset-500  36.710000   0.130000  36.840000 ( 36.946061)
garethrees commented 9 years ago

There are a few comments about object initialization, but seems to be similar for both offsets:

# Need to quit rails console after each run since previous objects will still exist
def count_xapian_objects
  xapian_objects = {}

  ObjectSpace.each_object do |o|
    if o.class.to_s.downcase.include?('xapian')
      if xapian_objects.key?(o.class.to_s)
        xapian_objects[o.class.to_s] += 1
      else
        xapian_objects[o.class.to_s] = 1
      end
    end
  end

  xapian_objects
end

$xapian_objects = count_xapian_objects
pp $xapian_objects

# offset: 0
{
 "ActsAsXapian::Search"=>1,
 "Xapian::Database"=>1,
 "Xapian::DateValueRangeProcessor"=>1,
 "Xapian::Enquire"=>1
 "Xapian::MSet"=>1,
 "Xapian::NumberValueRangeProcessor"=>2,
 "Xapian::Query"=>5,
 "Xapian::QueryParser"=>1,
 "Xapian::SimpleStopper"=>1,
 "Xapian::Stem"=>1,
 "Xapian::StringValueRangeProcessor"=>2,
}

# offset: 500
{
 "ActsAsXapian::Search"=>1
 "Xapian::Database"=>1,
 "Xapian::DateValueRangeProcessor"=>1,
 "Xapian::Enquire"=>1,
 "Xapian::MSet"=>1,
 "Xapian::NumberValueRangeProcessor"=>2,
 "Xapian::Query"=>5,
 "Xapian::QueryParser"=>1,
 "Xapian::SimpleStopper"=>1,
 "Xapian::Stem"=>1,
 "Xapian::StringValueRangeProcessor"=>2,
}
garethrees commented 9 years ago

I stripped the code down as much as possible to perform a search. Looks like the addition of the collapse_key is what really slows it down.

collapse_key 3 is collapse by request.

collapse_key 1 is collapse by created_at and doesn't exhibit anywhere near the performance impact.

crowbot commented 9 years ago

\o/

garethrees commented 9 years ago

Ah, so we are collapsing by the request's url_title. Not sure why…

  1. this is slow
  2. it doesn't use the request's id
garethrees commented 9 years ago

Tried sorting by created_at_numeric instead of plain created_at, but still had the same sort of issue.

- $enquire.sort_by_value_then_relevance!(0, sort_by_ascending)
+ $enquire.sort_by_value_then_relevance!(1, sort_by_ascending)
garethrees commented 9 years ago

Doesn't look like the higher offset is doing lots more in terms of collapsing documents:

# COLLAPSE_COUNT
search.matches.matches.map(&:collapse_count)

# offset-0
# => [2, 2, 2, 2, 2, 1, 0, 0, 1, 1, 2, 2, 0, 3, 0, 2, 3, 1, 0, 2, 1, 1, 1, 3, 1]

# offset-500
# => [5, 0, 7, 2, 0, 0, 3, 3, 0, 4, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 3, 0]
garethrees commented 9 years ago
                  user     system      total        real
offset-0      4.420000   0.050000   4.470000 (  4.539808)
offset-25     5.210000   0.060000   5.270000 (  5.268381)
offset-50     6.410000   0.050000   6.460000 (  6.472786)
offset-75     7.840000   0.050000   7.890000 (  7.903333)
offset-100    8.910000   0.040000   8.950000 (  8.966373)
offset-125   10.130000   0.060000  10.190000 ( 10.216313)
offset-150   11.620000   0.070000  11.690000 ( 11.700913)
offset-175   11.930000   0.030000  11.960000 ( 11.974550)
offset-200   14.030000   0.040000  14.070000 ( 14.082121)
offset-225   16.540000   0.060000  16.600000 ( 16.654570)
offset-250   16.870000   0.050000  16.920000 ( 16.940712)
offset-275   18.940000   0.050000  18.990000 ( 19.007111)
offset-300   20.060000   0.080000  20.140000 ( 20.165549)
offset-325   25.150000   0.110000  25.260000 ( 25.512738)
offset-350   24.380000   0.090000  24.470000 ( 24.573881)
offset-375   26.730000   0.100000  26.830000 ( 26.866396)
offset-400   27.110000   0.070000  27.180000 ( 27.209018)
offset-425   31.450000   0.060000  31.510000 ( 31.937909)
offset-450   32.230000   0.070000  32.300000 ( 32.366776)
offset-475   33.540000   0.060000  33.600000 ( 33.661678)
offset-500   36.070000   0.090000  36.160000 ( 36.221812)
offset-525   38.320000   0.100000  38.420000 ( 38.547484)
offset-550   40.370000   0.150000  40.520000 ( 40.658174)
offset-575   40.800000   0.120000  40.920000 ( 40.979741)
offset-600   41.700000   0.060000  41.760000 ( 41.811444)

screen shot 2015-06-09 at 09 23 05

garethrees commented 9 years ago

It looks like we're correctly removing documents from the index when we need to

garethrees commented 9 years ago
delve -t MInfoRequestEvent -d -1 $XAPIAN_DB_PATH

[EDIT]: Eyeballing this, everything looks fine. I don't think the issue is with indexing of duplicate InfoRequestEvent records, but indexing multiple events related to one `InfoRequest'.

garethrees commented 9 years ago

InfoRequestEvent has the following terms that don't appear to be searchable from the advanced search:

[ :latest_variety,         'K', "latest_variety" ],
[ :latest_status,          'L', "latest_status" ],
[ :waiting_classification, 'W', "waiting_classification" ],
garethrees commented 9 years ago

We should be indexing the :terms list as boolean terms, rather than text terms (i.e. those included in the :texts list).

There's no use of add_boolean_term in acts_as_xapian.rb. We do seem to add terms as boolean prefixes though, so I don't know if that implies add_boolean_term. I'm guessing not since in this case we're only telling the query parser about the available prefixes; not actually indexing them as such.

[EDIT:] Looking at the term list for an InfoRequestEvent, it does look like there are prefixed terms:

Term List for record #82314:
Bd_speers
C
Fmid_staffordshire_nhs_foundation_trust
IInfoRequestEvent-72030
Kfollowup_sent
Lsuccessful
MInfoRequestEvent
Raudit_of_accounts_2
Swaiting_response
T
Vsent
Wno
# …

I think we must just be joining the prefix and the data and dumping it in to the regular add_term method, which is probably what xapian does under the hood for add_boolean_term.

garethrees commented 9 years ago

Without other terms in a query, a xapian.ValueRangeProcessor can cause a value operation to be performed across the whole database, which means loading all the values in a given slot. On a small database, this isn’t a problem, but for a large one it can have performance implications: you may end up with very slow queries.

_– http://getting-started-with-xapian.readthedocs.org/en/latest/howtos/range_queries.html#performance-limitations_

garethrees commented 9 years ago

However, if the collapsing eliminates a lot of documents then the collapsed search will typically take rather longer than the uncollapsed search because the matcher has to consider many more potential matches.

http://getting-started-with-xapian.readthedocs.org/en/latest/howtos/collapsing.html#performance

  • [x] Figure out how to print number of collapsed documents each offset
Benchmark.bm(12) do |x|
 (1..20).each do |page|
    offset = (page - 1) * 25
    x.report("offset-#{offset}") {
      search = ActsAsXapian::Search.new([InfoRequestEvent], 'information',
          :offset => offset,
          :limit => 25,
          :sort_by_prefix => 'created_at',
          :sort_by_ascending => true,
          :collapse_by_prefix => 'request_collapse'
      )
    }

    est = search.matches.matches_estimated
    uncol = search.matches.get_uncollapsed_matches_estimated
    diff = uncol - est
    puts "Collapsed Matches: #{ est }\t Uncollapsed Matches: #{ uncol }\t Diff: #{ diff }"
  end
end

# =>                   user     system      total        real
# => offset-0      4.940000   0.030000   4.970000 (  4.985439)
# => Collapsed Matches: 260380        Uncollapsed Matches: 748443     Diff: 488063
# => offset-25     5.890000   0.040000   5.930000 (  5.947694)
# => Collapsed Matches: 260922        Uncollapsed Matches: 748443     Diff: 487521
# => offset-50     6.260000   0.060000   6.320000 (  6.322102)
# => Collapsed Matches: 260950        Uncollapsed Matches: 748443     Diff: 487493
# => offset-75     7.560000   0.020000   7.580000 (  7.589553)
# => Collapsed Matches: 260960        Uncollapsed Matches: 748443     Diff: 487483
# => offset-100    8.810000   0.020000   8.830000 (  8.852613)
# => Collapsed Matches: 260966        Uncollapsed Matches: 748443     Diff: 487477
# => offset-125    9.610000   0.040000   9.650000 (  9.662449)
# => Collapsed Matches: 260968        Uncollapsed Matches: 748443     Diff: 487475
# => offset-150   11.420000   0.020000  11.440000 ( 11.463205)
# => Collapsed Matches: 260979        Uncollapsed Matches: 748443     Diff: 487464
# => offset-175   12.990000   0.050000  13.040000 ( 13.071234)
# => Collapsed Matches: 260988        Uncollapsed Matches: 748443     Diff: 487455
# => offset-200   14.150000   0.070000  14.220000 ( 14.252923)
# => Collapsed Matches: 260989        Uncollapsed Matches: 748443     Diff: 487454
# => offset-225   18.060000   0.050000  18.110000 ( 18.346813)
# => Collapsed Matches: 260989        Uncollapsed Matches: 748443     Diff: 487454
# => offset-250   18.610000   0.110000  18.720000 ( 18.899493)
# => Collapsed Matches: 260991        Uncollapsed Matches: 748443     Diff: 487452
# => offset-275   20.140000   0.090000  20.230000 ( 20.310030)
# => Collapsed Matches: 260992        Uncollapsed Matches: 748446     Diff: 487454
# => offset-300   21.380000   0.080000  21.460000 ( 21.520247)
# => Collapsed Matches: 260993        Uncollapsed Matches: 748446     Diff: 487453
# => offset-325   23.450000   0.070000  23.520000 ( 23.612115)
# => Collapsed Matches: 260993        Uncollapsed Matches: 748446     Diff: 487453
# => offset-350   26.210000   0.120000  26.330000 ( 26.466379)
# => Collapsed Matches: 260993        Uncollapsed Matches: 748446     Diff: 487453
# => offset-375   29.500000   0.090000  29.590000 ( 30.029582)
# => Collapsed Matches: 260993        Uncollapsed Matches: 748446     Diff: 487453
# => offset-400   29.850000   0.140000  29.990000 ( 30.189773)
# => Collapsed Matches: 260993        Uncollapsed Matches: 748446     Diff: 487453
# => offset-425   32.120000   0.080000  32.200000 ( 32.396199)
# => Collapsed Matches: 260994        Uncollapsed Matches: 748446     Diff: 487452
# => offset-450   34.630000   0.170000  34.800000 ( 35.186630)
# => Collapsed Matches: 260994        Uncollapsed Matches: 748446     Diff: 487452
# => offset-475   35.910000   0.260000  36.170000 ( 36.494328)
# => Collapsed Matches: 260994        Uncollapsed Matches: 748446     Diff: 487452
garethrees commented 9 years ago

The indexed InfoRequestEvents that have a duplicate value 3 (request_collapse, aka: info_request.url_name) isn't actually as big as I thought. There are at most 9 documents with the same request_collapse value (for a given value). I guess this still ends up being quite a bit to look through, but as show in the previous comment, the diff between uncollapsed matches and collapsed matches seems consistent across offsets.

$ delve -t MInfoRequestEvent -V3 -1 $XAPIAN_DB | cut -d ':' -f 2 | sort | uniq -c | sort -r | head -n 20
      9 zil_lanes_and_speed_limits_2
      9 zenna_atkins
      9 youth_workers_in_the_council
      9 youth_service_commissioning_budg
      9 youth_offending_teams_9
      9 your_trusts_patient_transport_se_19
      9 your_funding_of_unlicensed_chari
      9 your_choice_of_photographs_for_m
      9 your_boards_handling_of_the_wilt
      9 young_workers
      9 yot_board_minutes_6
      9 yot_board_minutes_4
      9 yos_social_workers_employer_lond
      9 yellow_card_reports_for_vaccines
      9 x_v_middle_sussex_citizens_advic_2
      9 wsr_freehold
      9 wscc_clawback_attempts_of_indire
      9 written_responses_to_childcare_c
      9 wragge_co_andrea_hill_tendering
      9 works_affecting_my_family_life_e

Similar story with value 4 (request_title_collapse):

$ delve -t MInfoRequestEvent -V4 -1 $XAPIAN_DB | cut -d ':' -f 2 | sort | uniq -c | sort -r | head -n 20
      9 zenna_atkins
      9 youth_workers_in_the_council
      9 youth_service_commissioning_budg
      9 your_choice_of_photographs_for_m
      9 your_boards_handling_of_the_wilt
      9 young_workers
      9 yos_social_workers_employer_lond
      9 wychwood_festival_licensing
      9 wsr_freehold
      9 wscc_clawback_attempts_of_indire
      9 written_responses_to_childcare_c
      9 wragge_co_andrea_hill_tendering
      9 would_you_please_provide_informa
      9 world_cup_bid
      9 works_affecting_my_family_life_e
      9 work_programme_training_allowanc
      9 work_programme_subcontractors_ar
      9 work_programme_referrals_from_jo
      9 work_programme_memos
      9 working_with_police_investigatin
garethrees commented 9 years ago

This is definitely something to do with collapsing results. We don't collapse any other models so its a bit difficult to tell whether its the indexing of InfoRequestEvent or collapsing in general.

Benchmark.bm(12) do |x|
  (1..20).each do |page|
    offset = (page - 1) * 25
    x.report("offset-#{offset}") {
      search = ActsAsXapian::Search.new([PublicBody], 'council',
          :offset => offset,
          :limit => 25,
          :sort_by_prefix => 'created_at',
          :sort_by_ascending => true
      )
    }
  end
end
                  user     system      total        real
offset-0      0.080000   0.010000   0.090000 (  0.084110)
offset-25     0.080000   0.000000   0.080000 (  0.070196)
offset-50     0.070000   0.010000   0.080000 (  0.081525)
offset-75     0.080000   0.000000   0.080000 (  0.081872)
offset-100    0.070000   0.000000   0.070000 (  0.075652)
offset-125    0.060000   0.010000   0.070000 (  0.072631)
offset-150    0.070000   0.000000   0.070000 (  0.071983)
offset-175    0.070000   0.010000   0.080000 (  0.071903)
offset-200    0.070000   0.000000   0.070000 (  0.071999)
offset-225    0.070000   0.000000   0.070000 (  0.072633)
offset-250    0.060000   0.000000   0.060000 (  0.068001)
offset-275    0.080000   0.000000   0.080000 (  0.070095)
offset-300    0.060000   0.000000   0.060000 (  0.068544)
offset-325    0.060000   0.000000   0.060000 (  0.064197)
offset-350    0.070000   0.010000   0.080000 (  0.071372)
offset-375    0.070000   0.000000   0.070000 (  0.071740)
offset-400    0.070000   0.000000   0.070000 (  0.074969)
offset-425    0.060000   0.010000   0.070000 (  0.073323)
offset-450    0.080000   0.000000   0.080000 (  0.074181)
offset-475    0.070000   0.010000   0.080000 (  0.072749)

screen shot 2015-06-09 at 22 16 02

garethrees commented 9 years ago

New development! Looks like sort_by_value_then_relevance might be causing the problem https://gist.github.com/garethrees/e040ce55a51a9106c249#file-simple_xapian_search-rb-L40

Well, the combination of collapse and sort is bad. Either on their own is fine. Both together exhibit the slowdown at high offsets.

garethrees commented 9 years ago

Ticket raised in upstream Xapian http://trac.xapian.org/ticket/682

garethrees commented 9 years ago

RE: indexing InfoRequestEvent rather than InfoRequest https://github.com/mysociety/alaveteli/commit/15dc1f17ba6d2e08e38aff1a4293574247d62759

Not sure whether this could have been solved with:

class IncomingMessage < ActiveRecord::Base
-  belongs_to :info_request
+  belongs_to :info_request, :touch => true
end

First search indexing commit that I could find: https://github.com/mysociety/alaveteli/commit/8d7934d991e93097c981676630ac63a4465b96b5

garethrees commented 9 years ago

I don't think we can reasonably make this work without the fix in Xapian core itself.

We'd have to fetch all results (~585k for the word "information", when searching requests only) in order to sort them all before paginating, which would take literally hours.

garethrees commented 9 years ago

Ticket raised in upstream Xapian http://trac.xapian.org/ticket/682

The gist of this is that they want to drop in a heap library/template, but need to find a suitably licensed one. They've set the milestone as 1.3.x.

1.3.4 is likely to be out in August 2015

http://trac.xapian.org/wiki/RoadMap#DevelopmentReleases1

Gotcha for us is that 1.3.x is a development release (1.even_number.x is stable, 1.odd_number.x is development) so we want to consider whether we want to run an "unstable" release until 1.4.x comes out. They don't have a date for that yet, but they released 1.2.0 after 1.1.5, so hopefully we'll see the stable release before the end of 2015.

garethrees commented 9 years ago

Moving back to contender until we check progress on the upstream fix again.

stevenday commented 7 years ago

Just to update: Xapian is now at 1.4.3, but the bug is still unfixed https://lists.xapian.org/pipermail/xapian-discuss/2017-January/009491.html

equivalentideas commented 7 years ago

Useful suggestion from @handelaar downstream on Right To Know

It strikes me that this fix was rather WDTK-xapian-index-size specific and perhaps MAX_RESULTS belongs in a config setting with default=500, instead of being hard-coded?

mysociety/alaveteli@a1655ba

garethrees commented 7 years ago

Just to make ourselves feel a little better here, GitHub limit the search results to 100 pages:

https://github.com/search?p=1&q=test&type=Repositories&utf8=%E2%9C%93 – 2,094,620 results; 100 pages (via https://github.com/emadehsan/thal#conclusion).

garethrees commented 6 years ago

News! Olly (one of the Xapian developers) contacted me yesterday saying that he thinks he's fixed the problem.

I can't replicate the issue on my dev database – only the prod DB – so I couldn't provide him with something to test against. He's going to ping me once the code gets merged so that we can test it.

as things are, it'd be in 1.5.0, but fixing it in 1.4.x is probably possible with a bit of work - not sure if this is actually a problem for you, or it was just something you noticed

I've made him aware of this issue and the problem it causes, so hopeful we might see a 1.4.x release. I'm not sure of the ETA for either though.

garethrees commented 6 years ago

I'm sure we could create some sample data with a bit of effort, but I'd need to get all this back in to my head.

I think we'd "just" need to create lots of info requests with lots of searchable associations containing the term we want to search (e.g. comments, incoming messages, outgoing messages, all containing "Foo") so that lots of indexed events are found belonging to a single request, that need collapsing.

I've done something like this for sample pro data, so I'm sure we could use FactoryGirl to do the same in this case.

garethrees commented 6 years ago
[21:09:37]  <olly>  I've now merged the work that I think should address this: https://trac.xapian.org/ticket/682#comment:4
[21:10:18]  <olly>  if you can easily test with xapian git master, it'd be good to check if that actually solves this
[21:11:57]  <olly>  FWIW, i would expect it to slow down as you page on, but not in such a major way

Our next step here is to figure out how we can test this out.

I think we'd need to build a version of xapian-full against the commit referenced as the fix.

Then we'd need to install the new xapian-full – either by a pre-release gem, or just cloning the repo.

If we can copy the live db, or operate on a backup, that would be best.

Finally, we'd need to run my script against the target Xapian DB, ensuring we're using the updated Xapian libs.

ghost commented 5 years ago

Feel like this should be a high priority bug to fix.

garethrees commented 4 years ago

Possibly interesting article on pagination performance https://blog.jooq.org/2016/08/10/why-most-programmers-get-pagination-wrong/ (SQL based, not Xapian, but interesting here nonetheless).

garethrees commented 4 years ago

Might be able to do this now we have https://github.com/mysociety/alaveteli/pull/5720 merged.

RichardTaylor commented 3 years ago

+1 Someone looking to download some data from WhatDoTheyKnow to do some machine learning experiments with came across this issue and found it annoying.

garethrees commented 3 years ago

We've recently updated the xapian package and ruby bindings (https://github.com/mysociety/alaveteli/pull/6118) so I thought I'd take a look at this again.

Running the original test again, it looks like the response time is still significantly slower with high offsets. It is better than the previous run, but we did move to a new machine in ~late 2017 since the original test in 2015.

Benchmark.bmbm do |x|
  x.report("offset-0") {
    result = ActsAsXapian::Search.new([InfoRequestEvent], 'information',
      :offset => 0,
      :limit => 25,
      :sort_by_prefix => 'created_at',
      :sort_by_ascending => true,
      :collapse_by_prefix => 'request_collapse'
    )
  }
  x.report("offset-500") {
    result = ActsAsXapian::Search.new([InfoRequestEvent], 'information',
      :offset => 500,
      :limit => 25,
      :sort_by_prefix => 'created_at',
      :sort_by_ascending => true,
      :collapse_by_prefix => 'request_collapse'
    )
  }
end

# Rehearsal ----------------------------------------------
# offset-0     6.068000   0.176000   6.244000 (  6.255512)
# offset-500  22.596000   0.140000  22.736000 ( 22.793988)
# ------------------------------------ total: 28.980000sec
# 
#                  user     system      total        real
# offset-0     6.196000   0.052000   6.248000 (  6.272437)
# offset-500  23.192000   0.068000  23.260000 ( 23.312819)

I also made a small tweak to the original benchmark script to enable/disable the problematic operator through the environment and ran it.

$ bundle exec rails runner /home/gareth/simple_xapian_search.rb
Rehearsal ----------------------------------------------
offset-0     0.488000   0.008000   0.496000 (  0.495934)
offset-500   0.480000   0.008000   0.488000 (  0.489646)
------------------------------------- total: 0.984000sec

                 user     system      total        real
offset-0     0.552000   0.000000   0.552000 (  0.554979)
offset-500   0.592000   0.000000   0.592000 (  0.596174)

$ SORT_BY_VALUE_THEN_RELEVANCE=1 bundle exec rails runner /home/gareth/simple_xapian_search.rb
Rehearsal ----------------------------------------------
offset-0     1.360000   0.064000   1.424000 (  1.426801)
offset-500   4.160000   0.160000   4.320000 (  4.330838)
------------------------------------- total: 5.744000sec

                 user     system      total        real
offset-0     1.308000   0.060000   1.368000 (  1.367158)
offset-500   4.188000   0.144000   4.332000 (  4.331909)

It looks like the previous performance was 5x slower, which is pretty close to today's test (4x slower).

I think the next step is to feed this back upstream (apologising for the huuuuugely lengthy feedback cycle) and see what they say.

RichardTaylor commented 2 years ago

I'm here as a WhatDoTheyKnow user has written to the team reporting this bug/issue.

Is the main issue here server resources when the site is being scraped by search engine crawlers and similar? If it is could paging beyond the first 500 results be restored for logged in users?

garethrees commented 2 years ago

It's a performance issue with Xapian, the underlying search engine.

RichardTaylor commented 2 years ago

It's a https://github.com/mysociety/alaveteli/issues/2137#issuecomment-885670199 with Xapian, the underlying search engine.

Yes, but my question was is the performance issue so bad that we can't make the feature available to logged in users who might be expected to use it rarely even if we can't make it available to non logged in users such as search engine crawlers who might stress the site/service by making lots of requests.

RichardTaylor commented 2 years ago

The user who prompted this discussion today reported:

We've got a workaround - restricting searches by month/quarter, which prevents generating too many results.

garethrees commented 2 years ago

is the performance issue so bad

Yeah, with every page deeper in to results the response time increases until it times out at the web server level.

FOIMonkey commented 2 years ago

A user has raised this issue, having got the "Sorry, we couldn't find that page" page when trying to access page 120 of the search results.

FOIMonkey commented 1 year ago

+1 A user has written about being unable to access search results beyond page 20, when the system said there should be 167 pages of results.

garethrees commented 1 year ago

Google has a similar issue.

FozzieHi commented 1 year ago

Given that this has been happening in some form or another since 2015, would it be a good idea to include some form of information on the 404 page when trying to navigate to a page after 500 results, to reduce confusion and support tickets?

Something like

Due to technical limitations, we cannot currently display this page.

Either that, or artificially limit the page navigator at the bottom to stop at 500 results? Is there any benefit of saying an authority has 200 pages if the results are limited past 500 results, for example?

This is of course assuming this 500 results limitation is in Alaveteli and not configured per instance (I have checked on multiple different instances, and the behaviour remains the same).

I'm happy to work on a PR for this, if others agree. Although I have no experience with Ruby at all, so it might take me a while!

garethrees commented 1 year ago

Hey @FozzieHi, that's a very kind offer! ❤️

This is of course assuming this 500 results limitation is in Alaveteli and not configured per instance

Yeah, that's hard-coded, for example in https://github.com/mysociety/alaveteli/blob/0.43.0.0/app/controllers/request_controller.rb#L25

Either that, or artificially limit the page navigator at the bottom to stop at 500 results? Is there any benefit of saying an authority has 200 pages if the results are limited past 500 results, for example?

It's been a while since I've looked at the specifics of this, but I think this already happens when searching requests. For example, a search for "information" scoped to requests estimates about 2000000 results, but only shows 20 pages.

Screenshot 2023-07-12 at 09 31 48Screenshot 2023-07-12 at 09 31 57

It looks like that limiting of pages doesn't happen when you scope the search to authorities or users.

Screenshot 2023-07-12 at 09 33 35Screenshot 2023-07-12 at 09 33 40

In those latter cases, navigating past page 20 causes the 404.

would it be a good idea to include some form of information on the 404 page when trying to navigate to a page after 500 results

If it's easy, then I wouldn't be adverse to that. I think we could tune the message to be more like:

Due to technical limitations we cannot currently display past page 20 of search results. Try refining the search terms to limit the results.

What might be easier – and prevent the majority of users even hitting that 404 – would be to replicate the pagination behaviour from requests to the authorities and users scoped search. I don't imagine many will be manipulating the query params to paginate, so just limiting the number of page links displayed will keep most people from hitting a page past the limit.

Would be nice to include a message about this somewhere – either at the end of the results on page 20, or under the pagination nav if we're artificially limiting the pages?

Screenshot 2023-07-12 at 09 39 44