mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 196 forks source link

Uncaught duplicate requests #4160

Closed kingqueen3065 closed 11 months ago

kingqueen3065 commented 7 years ago

This request: https://www.whatdotheyknow.com/request/use_of_public_cloud_4 and this one: https://www.whatdotheyknow.com/request/use_of_public_cloud_3 appear to be identical. They are from the same user, to the same authority, with the same subject line and with (so far as I can determine including via a file compare utility) exactly the same content (sent 14 minutes apart.) I think that technical measures to prevent such may have failed.

garethrees commented 7 years ago

Might be to do with differing newlines which get stripped – we might do the validation before this point?

lizconlan commented 7 years ago

The InfoRequest.titles are identical, and the OutgoingMessage.body appear identical.

However, the SQL query is not matching the body text so InfoRequest.find_existing is returning nil (and allowing the duplicate)

Postgres is whitespace sensitive and is not matching the body text content

lizconlan commented 7 years ago

Possible (horrible) solution:

  def self.find_existing(title, public_body_id, body)
    conditions = { :title => title,
                   :public_body_id => public_body_id }

    matching = InfoRequest.
                 where(conditions).
                    first

    if matching
      ogm_ids = matching.outgoing_messages.pluck(:id)
      matching_ids = matching.outgoing_messages.
        where(
          "regexp_replace(outgoing_messages.body,
          '[[:space:]]', '', 'g') = regexp_replace(?, '[[:space:]]', '', 'g')",
          body).pluck(:id)
      !(ogm_ids & matching_ids).empty?
    else
      nil
    end
lizconlan commented 7 years ago

no, wait - it's treating e.g. "\r" as the literal value \r (a backslash followed by an r)

lizconlan commented 7 years ago

Might be to do with differing newlines which get stripped

They only get stripped if they're leading or trailing space

garethrees commented 7 years ago

So AIUI, the text submitted through params contains windows newlines, but we then save as unix newlines so the two don't match?

lizconlan commented 7 years ago

So AIUI, the text submitted through params contains windows newlines, but we then save as unix newlines so the two don't match?

We're saving it as submitted - the request mentioned returns this (excerpt) "Dear Department for Business, Energy and Industrial Strategy,\r\n\ ..."

I think what's happening is this... https://stackoverflow.com/a/36029083

"The backslash has no special meaning in SQL, so '\n' is a backslash followed by the character n"

In other words, Postgres does not see e.g. \n as a newline character but just as 2 chars of text so "\n" won't match unless we escape it.

lizconlan commented 7 years ago

This might be specific to how the WDTK database is configured as this is the behaviour we're seeing:

message = OutgoingMessage.find(676453)
OutgoingMessage.where(body: message.body)
=> #<ActiveRecord::Relation []>

Testing the returned message body with .include?('\n') returns false which suggests that we're storing literal/non-escaped text and the comparison with "\n" is failing.

garethrees commented 7 years ago

Okay cool – Its not worth spending a load of time on this sprint as mentioned in sprint planning, so lets circle back on this if we have time at the end of the sprint.

RichardTaylor commented 6 years ago

All,

I recall three user-support cases this week which, as an aside to the main issue being drawn to our attention, involved duplicate requests which weren't caught by the system;

There were six duplicate requests made in June 2018 at https://www.whatdotheyknow.com/admin/users/160072

I can't quickly find the other examples. The fact they've come to our attention this week is almost certainly a coincidence.

There is a reputational risk if we allow people to use the system to send duplicate messages.

RichardTaylor commented 5 years ago

If this is fixed but not deployed this is probably a redundant comment but we've had our attention drawn to the following pair of duplicates made on 2 November 2018

https://www.whatdotheyknow.com/request/joint_response_unit https://www.whatdotheyknow.com/request/joint_response_unit_2

garethrees commented 5 years ago

This would have been deployed a while ago, so reopening. The recent examples may be a slightly different variation on the original problem.

RichardTaylor commented 5 years ago

We still get support requests relating to duplicate requests. 20 November 2018 15:51:48 - https://www.whatdotheyknow.com/admin/requests/533931 20 November 2018 15:52:44 https://www.whatdotheyknow.com/admin/requests/533930

The user in question in this case thought a bug in the system had somehow caused, rather than merely failed to prevent, this duplicate request.

RichardTaylor commented 5 years ago

This bug is still causing problems for users, administrators, and public bodies.

The WDTK user https://www.whatdotheyknow.com/admin/users/172755 sent what looks like the same request at both 18 December 2018 23:59:40 and 18 December 2018 23:59:12

The public body responded to only one, and the user - apparantly unaware of the duplicate - emailed them directly in a thread the WDTK support team were copied into - chasing up a lack of a response - given there was no response on one of the threads. Both the user and the public body was confused.

RichardTaylor commented 5 years ago

Another case prompting support mail. User https://www.whatdotheyknow.com/admin/users/156812 made what looks like the same request at: 16 April 2018 15:35:44 and 16 April 2018 15:35:16 and 16 April 2018 15:35:28

I suspect even exact duplicates might not be, or might not have been, properly matched but in cases of non-exact matches perhaps a rate limit on repeated requests to the same body might help address this issue and have wider benefits too?

gbp commented 5 years ago

I've been using the following query to detect duplicate requests, it returns all the requests listed in this thread. In total it returns 4820 rows.

SELECT DISTINCT ON (ir1.id) ir1.id, ir1.title, ir1.user_id, ir1.public_body_id, ir1.created_at

FROM outgoing_messages AS og1

INNER JOIN outgoing_messages AS og2 ON regexp_replace(og1.body, '[[:space:]]', '', 'g') = regexp_replace(og2.body, '[[:space:]]', '', 'g')
INNER JOIN info_requests AS ir1 ON ir1.id = og1.info_request_id
INNER JOIN info_requests AS ir2 ON ir2.id = og2.info_request_id

WHERE og1.message_type = 'initial_request' AND og2.message_type = 'initial_request'
  AND og1.status = 'sent' AND og2.status = 'sent'
  AND ir1.public_body_id = ir2.public_body_id AND ir1.title = ir2.title AND ir1.user_id = ir2.user_id
  AND ir1.id != ir2.id

ORDER BY ir1.id ASC, created_at ASC;

There doesn't seem an obvious start date to this issue although a possible increase around Aug 2017 screen shot 2019-02-11 at 16 08 40

RichardTaylor commented 5 years ago

At https://www.whatdotheyknow.com/admin/users/175817 we have duplicates with created date times of: 10 February 2019 16:30:14 9 February 2019 21:01:20 9 February 2019 20:03:04

This may be a different issue to the duplicates created within a few seconds of each other.

The user has stated:

I think the duplicates were caused when I received your initial acknowledgment, and I was not certain that I had actually entered my post on your site, and I clicked on the acknowledgment three times

gbp commented 5 years ago

The user has stated:

I think the duplicates were caused when I received your initial acknowledgment, and I was not certain that I had actually entered my post on your site, and I clicked on the acknowledgment three times

I looked at the logs for these duplicate requests earlier today and this certainly is what happened. I am working on a way of expiring the confirmations links emails to prevent this.

RichardTaylor commented 5 years ago

Another pair of duplicate requests https://www.whatdotheyknow.com/admin/requests/555412 https://www.whatdotheyknow.com/admin/requests/555382

RichardTaylor commented 4 years ago

WhatDoTheyKnow requests 617757, 617752, 617775 and 617751 appear to be a set of uncaught duplicates.

gbp commented 4 years ago

WhatDoTheyKnow requests 617757, 617752, 617775 and 617751 appear to be a set of uncaught duplicates.

Have remove the Post Redirect responsible for creating these dups.

RichardTaylor commented 4 years ago

A WhatDoTheyKnow user has just written to apologise for 624387, 624396, 624403. I've apologised in return for the system allowing the duplicates :-)

MattK1234 commented 4 years ago

A user at https://www.whatdotheyknow.com/admin/users/198867 emailed in to the support team stating every time he logged in a new request was sent. This led to about 5 requests before it was reported.

Upon investigation, it seems the user was clicking the post redirect for email confirmation each time they wanted to access WDTK, and every time it was firing off a new request.

I was a bit slow in creating this entry, so we have already deleted the duplicate requests. Apologies if this hinders investigation into what has occurred.

I come across various duplicates when browsing the site, I'll try to remember to add them to this log when I see them.

There's a bit of a reputational risk here if we're bombarding authorities with duplicates, plus we're confusing users and potentially increasing the number of unanswered requests on the site (assuming the authority reply to one email and dump the rest).

MattK1234 commented 4 years ago

The requests at 667916 and 667917 look identical to me, and were submitted within one second of each other.

Also, requests 667901, and 667902 are identical, submitted within 2 minutes of one another.

MattK1234 commented 4 years ago

Noticed whilst browsing the site just now requests 678826 and 678827 look identical, submitted within 10 seconds of one another.

MattK1234 commented 4 years ago

The following are all duplicate requests on WDTK of request 682411:

RichardTaylor commented 3 years ago

Requests 700699, 700702, 700701, 700700, and 696427 on WhatDoTheyKnow appear to be duplicates of each other.

RichardTaylor commented 3 years ago

Requests 703435 and 703433 on WhatDoTheyKnow are duplicates.

Duplicate requests confuse users Duplicate requests often result in manual work by administrators Duplicate requests risk harming our and our requester's reputation Duplicate requests may be rejected (though perhaps the first one should get a response)

RichardTaylor commented 3 years ago

703436, 703473, 703600, 703604, 703605 on WhatDoTheyKnow are duplicates

mdeuk commented 3 years ago

Requests 707565 and 707566 on WDTK are duplicates - the cause, in this instance, is the administrator 'clicking' the confirmation link twice (computer glitch).

Having request confirmation links expire once used (as per #5949) would be a useful step in preventing this issue.

RichardTaylor commented 3 years ago

721611 and 719433 are duplicates, suspect the non-expiring confirmation links (#5949) may be the underlying issue here.

RichardTaylor commented 3 years ago

Set of uncaught duplicates on WDTK user account 221720. Without this issue there would presumably only have been one request, and volunteer admin time wouldn't have needed to be spent dealing with all of them.

RichardTaylor commented 3 years ago

Requests 722651 and 722652 on WDTK appear to be duplicates which should have been caught.

RichardTaylor commented 3 years ago

Four duplicates on WDTK user account 218176 came to our attention today - causing complexity and admin work.

RichardTaylor commented 3 years ago

+1 this bug is causing a fair amount of admin work

RichardTaylor commented 3 years ago

+1 for this being a bug which causes problems for users, admins and public bodies.

RichardTaylor commented 3 years ago

+1 What look like duplicates at WDTK requests 739179, 739178, 739177.

This came to attention as a response ended up in the holding pen - I don't know if the holding pen case related to the requests being duplicates.

RichardTaylor commented 3 years ago

+1

"Requests not appearing immediately on request list page" #1168 causes users to try and make their request again. This combined with the issue of "Uncaught duplicate requests" #4160 often results in a mess of duplicate requests which administrators are called on to deal with via eg. making some of the threads requester_only and marking them withdrawn to keep things tidy.

Sometimes there are subsequent issues if correspondence continues on a requester_only thread which then needs to be made public.

As noted above this issue causes problems for users, public bodies and site administrators.

RichardTaylor commented 3 years ago

Sometimes duplicate requests have the same body but different subject/titles.

I imagine these occur when users think they have failed to make a request (due to #1168), and submit their request again, perhaps copy and pasting the text but manually writing a new subject/title.

gbp commented 3 years ago

More duplicates: 773467 773474 773475 773498

MattK1234 commented 3 years ago

Request 781105 is an exact duplicate of 781104 (WDTK).

Submitted within 2 minutes of one another.

RichardTaylor commented 3 years ago

This is still causing problems for users, public bodies and site administrators.

On WhatDoTheyKnow today we've had the following reported as duplicates:

1791307 1791316 1791314 1791313 1791312 1791311 1791308

gbp commented 3 years ago

This query to detect users with possible duplicate requests

SELECT DISTINCT u.id AS user_id

FROM users AS u
INNER JOIN info_requests r1 ON r1.user_id = u.id
INNER JOIN info_requests r2 ON r2.user_id = u.id
INNER JOIN post_redirects p ON p.user_id = u.id

WHERE

r1.title = r2.title AND
r1.info_request_batch_id IS NULL AND
r2.info_request_batch_id IS NULL AND
r1.id != r2.id AND r1.public_body_id = r2.public_body_id AND
p.uri = '/new';

Is returning 179 rows

gbp commented 3 years ago

It seems a large number of duplicates are due leading/trailing whitespace in the request title stored in the post redirect. This doesn't match what is stored in the database due to it being stripped.

gbp commented 3 years ago

After further testing it seems like whitespace in the title is the root cause of all the duplicates I can see.

This query returns 102 users (with an active post redirect records) which have accidentally duplicate requests (where the public body, title and body all match).

SELECT DISTINCT
u.id AS user_id,
r1.title AS request_title

FROM users AS u

INNER JOIN info_requests r1 ON r1.user_id = u.id
INNER JOIN outgoing_messages m1 ON m1.info_request_id = r1.id AND m1.message_type = 'initial_request'
INNER JOIN info_requests r2 ON r2.user_id = u.id
INNER JOIN outgoing_messages m2 ON m2.info_request_id = r2.id AND m2.message_type = 'initial_request'
INNER JOIN post_redirects p ON p.user_id = u.id

WHERE
BTRIM(SUBSTRING(p.post_params_yaml FROM 'title: ([^\n]*).*'), E'\'') LIKE '%' || r1.title || '%' AND
r1.title = r2.title AND
m1.body = m2.body AND
r1.info_request_batch_id IS NULL AND
r2.info_request_batch_id IS NULL AND
r1.id < r2.id AND
r1.public_body_id = r2.public_body_id AND
p.uri = '/new';
gbp commented 3 years ago

No doubt there will be still reports of duplicates created before today (I've just deployed), but there shouldn't be any more created.

RichardTaylor commented 2 years ago

A pair of duplicate requests pre-dating this fix were drawn to the attention of WDTK admins today:

https://www.whatdotheyknow.com/request/bounty_72 https://www.whatdotheyknow.com/request/bounty_48

Does this fix address duplicates by batch requesters?

garethrees commented 2 years ago

Does this fix address duplicates by batch requesters?

I think this is a separate issue https://github.com/mysociety/alaveteli/issues/6696

RichardTaylor commented 2 years ago

WhatDoTheyKnow requests 832411, 832399, 832374 appear to be duplicates. Could unusual characters in the subjects perhaps have prevented them from being caught?

RichardTaylor commented 2 years ago

WhatDoTheyKnow requests 836205, 836200, 836197 and 836190 created on 23 February 2022 in quick succession appear to be duplicates.