Closed kingqueen3065 closed 11 months ago
Might be to do with differing newlines which get stripped – we might do the validation before this point?
The InfoRequest.title
s are identical, and the OutgoingMessage.body
appear identical.
However, the SQL query is not matching the body text so InfoRequest.find_existing
is returning nil
(and allowing the duplicate)
Postgres is whitespace sensitive and is not matching the body text content
Possible (horrible) solution:
def self.find_existing(title, public_body_id, body)
conditions = { :title => title,
:public_body_id => public_body_id }
matching = InfoRequest.
where(conditions).
first
if matching
ogm_ids = matching.outgoing_messages.pluck(:id)
matching_ids = matching.outgoing_messages.
where(
"regexp_replace(outgoing_messages.body,
'[[:space:]]', '', 'g') = regexp_replace(?, '[[:space:]]', '', 'g')",
body).pluck(:id)
!(ogm_ids & matching_ids).empty?
else
nil
end
no, wait - it's treating e.g. "\r" as the literal value \r
(a backslash followed by an r)
Might be to do with differing newlines which get stripped
They only get stripped if they're leading or trailing space
So AIUI, the text submitted through params contains windows newlines, but we then save as unix newlines so the two don't match?
So AIUI, the text submitted through params contains windows newlines, but we then save as unix newlines so the two don't match?
We're saving it as submitted - the request mentioned returns this (excerpt) "Dear Department for Business, Energy and Industrial Strategy,\r\n\ ..."
I think what's happening is this... https://stackoverflow.com/a/36029083
"The backslash has no special meaning in SQL, so '\n' is a backslash followed by the character n"
In other words, Postgres does not see e.g. \n
as a newline character but just as 2 chars of text so "\n" won't match unless we escape it.
This might be specific to how the WDTK database is configured as this is the behaviour we're seeing:
message = OutgoingMessage.find(676453)
OutgoingMessage.where(body: message.body)
=> #<ActiveRecord::Relation []>
Testing the returned message body with .include?('\n')
returns false which suggests that we're storing literal/non-escaped text and the comparison with "\n"
is failing.
Okay cool – Its not worth spending a load of time on this sprint as mentioned in sprint planning, so lets circle back on this if we have time at the end of the sprint.
All,
I recall three user-support cases this week which, as an aside to the main issue being drawn to our attention, involved duplicate requests which weren't caught by the system;
There were six duplicate requests made in June 2018 at https://www.whatdotheyknow.com/admin/users/160072
I can't quickly find the other examples. The fact they've come to our attention this week is almost certainly a coincidence.
There is a reputational risk if we allow people to use the system to send duplicate messages.
If this is fixed but not deployed this is probably a redundant comment but we've had our attention drawn to the following pair of duplicates made on 2 November 2018
https://www.whatdotheyknow.com/request/joint_response_unit https://www.whatdotheyknow.com/request/joint_response_unit_2
This would have been deployed a while ago, so reopening. The recent examples may be a slightly different variation on the original problem.
We still get support requests relating to duplicate requests. 20 November 2018 15:51:48 - https://www.whatdotheyknow.com/admin/requests/533931 20 November 2018 15:52:44 https://www.whatdotheyknow.com/admin/requests/533930
The user in question in this case thought a bug in the system had somehow caused, rather than merely failed to prevent, this duplicate request.
This bug is still causing problems for users, administrators, and public bodies.
The WDTK user https://www.whatdotheyknow.com/admin/users/172755 sent what looks like the same request at both 18 December 2018 23:59:40 and 18 December 2018 23:59:12
The public body responded to only one, and the user - apparantly unaware of the duplicate - emailed them directly in a thread the WDTK support team were copied into - chasing up a lack of a response - given there was no response on one of the threads. Both the user and the public body was confused.
Another case prompting support mail. User https://www.whatdotheyknow.com/admin/users/156812 made what looks like the same request at: 16 April 2018 15:35:44 and 16 April 2018 15:35:16 and 16 April 2018 15:35:28
I suspect even exact duplicates might not be, or might not have been, properly matched but in cases of non-exact matches perhaps a rate limit on repeated requests to the same body might help address this issue and have wider benefits too?
I've been using the following query to detect duplicate requests, it returns all the requests listed in this thread. In total it returns 4820 rows.
SELECT DISTINCT ON (ir1.id) ir1.id, ir1.title, ir1.user_id, ir1.public_body_id, ir1.created_at
FROM outgoing_messages AS og1
INNER JOIN outgoing_messages AS og2 ON regexp_replace(og1.body, '[[:space:]]', '', 'g') = regexp_replace(og2.body, '[[:space:]]', '', 'g')
INNER JOIN info_requests AS ir1 ON ir1.id = og1.info_request_id
INNER JOIN info_requests AS ir2 ON ir2.id = og2.info_request_id
WHERE og1.message_type = 'initial_request' AND og2.message_type = 'initial_request'
AND og1.status = 'sent' AND og2.status = 'sent'
AND ir1.public_body_id = ir2.public_body_id AND ir1.title = ir2.title AND ir1.user_id = ir2.user_id
AND ir1.id != ir2.id
ORDER BY ir1.id ASC, created_at ASC;
There doesn't seem an obvious start date to this issue although a possible increase around Aug 2017
At https://www.whatdotheyknow.com/admin/users/175817 we have duplicates with created date times of: 10 February 2019 16:30:14 9 February 2019 21:01:20 9 February 2019 20:03:04
This may be a different issue to the duplicates created within a few seconds of each other.
The user has stated:
I think the duplicates were caused when I received your initial acknowledgment, and I was not certain that I had actually entered my post on your site, and I clicked on the acknowledgment three times
The user has stated:
I think the duplicates were caused when I received your initial acknowledgment, and I was not certain that I had actually entered my post on your site, and I clicked on the acknowledgment three times
I looked at the logs for these duplicate requests earlier today and this certainly is what happened. I am working on a way of expiring the confirmations links emails to prevent this.
Another pair of duplicate requests https://www.whatdotheyknow.com/admin/requests/555412 https://www.whatdotheyknow.com/admin/requests/555382
WhatDoTheyKnow requests 617757, 617752, 617775 and 617751 appear to be a set of uncaught duplicates.
WhatDoTheyKnow requests 617757, 617752, 617775 and 617751 appear to be a set of uncaught duplicates.
Have remove the Post Redirect responsible for creating these dups.
A WhatDoTheyKnow user has just written to apologise for 624387, 624396, 624403. I've apologised in return for the system allowing the duplicates :-)
A user at https://www.whatdotheyknow.com/admin/users/198867 emailed in to the support team stating every time he logged in a new request was sent. This led to about 5 requests before it was reported.
Upon investigation, it seems the user was clicking the post redirect for email confirmation each time they wanted to access WDTK, and every time it was firing off a new request.
I was a bit slow in creating this entry, so we have already deleted the duplicate requests. Apologies if this hinders investigation into what has occurred.
I come across various duplicates when browsing the site, I'll try to remember to add them to this log when I see them.
There's a bit of a reputational risk here if we're bombarding authorities with duplicates, plus we're confusing users and potentially increasing the number of unanswered requests on the site (assuming the authority reply to one email and dump the rest).
The requests at 667916 and 667917 look identical to me, and were submitted within one second of each other.
Also, requests 667901, and 667902 are identical, submitted within 2 minutes of one another.
Noticed whilst browsing the site just now requests 678826 and 678827 look identical, submitted within 10 seconds of one another.
The following are all duplicate requests on WDTK of request 682411:
Requests 700699, 700702, 700701, 700700, and 696427 on WhatDoTheyKnow appear to be duplicates of each other.
Requests 703435 and 703433 on WhatDoTheyKnow are duplicates.
Duplicate requests confuse users Duplicate requests often result in manual work by administrators Duplicate requests risk harming our and our requester's reputation Duplicate requests may be rejected (though perhaps the first one should get a response)
703436, 703473, 703600, 703604, 703605 on WhatDoTheyKnow are duplicates
Requests 707565 and 707566 on WDTK are duplicates - the cause, in this instance, is the administrator 'clicking' the confirmation link twice (computer glitch).
Having request confirmation links expire once used (as per #5949) would be a useful step in preventing this issue.
721611 and 719433 are duplicates, suspect the non-expiring confirmation links (#5949) may be the underlying issue here.
Set of uncaught duplicates on WDTK user account 221720. Without this issue there would presumably only have been one request, and volunteer admin time wouldn't have needed to be spent dealing with all of them.
Requests 722651 and 722652 on WDTK appear to be duplicates which should have been caught.
Four duplicates on WDTK user account 218176 came to our attention today - causing complexity and admin work.
+1 this bug is causing a fair amount of admin work
+1 for this being a bug which causes problems for users, admins and public bodies.
+1 What look like duplicates at WDTK requests 739179, 739178, 739177.
This came to attention as a response ended up in the holding pen - I don't know if the holding pen case related to the requests being duplicates.
+1
"Requests not appearing immediately on request list page" #1168 causes users to try and make their request again. This combined with the issue of "Uncaught duplicate requests" #4160 often results in a mess of duplicate requests which administrators are called on to deal with via eg. making some of the threads requester_only and marking them withdrawn to keep things tidy.
Sometimes there are subsequent issues if correspondence continues on a requester_only thread which then needs to be made public.
As noted above this issue causes problems for users, public bodies and site administrators.
Sometimes duplicate requests have the same body but different subject/titles.
I imagine these occur when users think they have failed to make a request (due to #1168), and submit their request again, perhaps copy and pasting the text but manually writing a new subject/title.
More duplicates: 773467 773474 773475 773498
Request 781105 is an exact duplicate of 781104 (WDTK).
Submitted within 2 minutes of one another.
This is still causing problems for users, public bodies and site administrators.
On WhatDoTheyKnow today we've had the following reported as duplicates:
1791307 1791316 1791314 1791313 1791312 1791311 1791308
This query to detect users with possible duplicate requests
SELECT DISTINCT u.id AS user_id
FROM users AS u
INNER JOIN info_requests r1 ON r1.user_id = u.id
INNER JOIN info_requests r2 ON r2.user_id = u.id
INNER JOIN post_redirects p ON p.user_id = u.id
WHERE
r1.title = r2.title AND
r1.info_request_batch_id IS NULL AND
r2.info_request_batch_id IS NULL AND
r1.id != r2.id AND r1.public_body_id = r2.public_body_id AND
p.uri = '/new';
Is returning 179 rows
It seems a large number of duplicates are due leading/trailing whitespace in the request title stored in the post redirect. This doesn't match what is stored in the database due to it being stripped.
After further testing it seems like whitespace in the title is the root cause of all the duplicates I can see.
This query returns 102 users (with an active post redirect records) which have accidentally duplicate requests (where the public body, title and body all match).
SELECT DISTINCT
u.id AS user_id,
r1.title AS request_title
FROM users AS u
INNER JOIN info_requests r1 ON r1.user_id = u.id
INNER JOIN outgoing_messages m1 ON m1.info_request_id = r1.id AND m1.message_type = 'initial_request'
INNER JOIN info_requests r2 ON r2.user_id = u.id
INNER JOIN outgoing_messages m2 ON m2.info_request_id = r2.id AND m2.message_type = 'initial_request'
INNER JOIN post_redirects p ON p.user_id = u.id
WHERE
BTRIM(SUBSTRING(p.post_params_yaml FROM 'title: ([^\n]*).*'), E'\'') LIKE '%' || r1.title || '%' AND
r1.title = r2.title AND
m1.body = m2.body AND
r1.info_request_batch_id IS NULL AND
r2.info_request_batch_id IS NULL AND
r1.id < r2.id AND
r1.public_body_id = r2.public_body_id AND
p.uri = '/new';
No doubt there will be still reports of duplicates created before today (I've just deployed), but there shouldn't be any more created.
A pair of duplicate requests pre-dating this fix were drawn to the attention of WDTK admins today:
https://www.whatdotheyknow.com/request/bounty_72 https://www.whatdotheyknow.com/request/bounty_48
Does this fix address duplicates by batch requesters?
Does this fix address duplicates by batch requesters?
I think this is a separate issue https://github.com/mysociety/alaveteli/issues/6696
WhatDoTheyKnow requests 832411, 832399, 832374 appear to be duplicates. Could unusual characters in the subjects perhaps have prevented them from being caught?
WhatDoTheyKnow requests 836205, 836200, 836197 and 836190 created on 23 February 2022 in quick succession appear to be duplicates.
This request: https://www.whatdotheyknow.com/request/use_of_public_cloud_4 and this one: https://www.whatdotheyknow.com/request/use_of_public_cloud_3 appear to be identical. They are from the same user, to the same authority, with the same subject line and with (so far as I can determine including via a file compare utility) exactly the same content (sent 14 minutes apart.) I think that technical measures to prevent such may have failed.