mysociety / alaveteli

Provide a Freedom of Information request system for your jurisdiction
https://alaveteli.org
Other
389 stars 195 forks source link

Bug with standard email redaction #6400

Open RichardTaylor opened 3 years ago

RichardTaylor commented 3 years ago

Email addresses in request correspondence should be replaced with eg.

[email address]

(Email addresses requests are sent to are treated specially and identified as the public body's request address)

At the moment on WhatDoTheyKnow.com, even on non /cy/ (Welsh) threads, we are in some circumstances seeing email addresses replaced with

[cyfeiriad ebost]

which is the Welsh for email address.

No link to help text is offered either.

eg.

https://www.whatdotheyknow.com/request/notification_of_spa_requests#incoming-1760631 https://www.whatdotheyknow.com/request/humberside_catalytic_converter_t#incoming-1627530 https://www.whatdotheyknow.com/request/data_collecting_storing_sharing_6#incoming-1742773

Looking at the examples this appears to be linked to the provision of links in reference style. ( https://github.com/mysociety/alaveteli/issues/4578 )

This issue also appears to be affecting attachments eg.

Google for eg.

"[cyfeiriad ebost] " site:whatdotheyknow.com -attach

and

"[cyfeiriad ebost] " site:whatdotheyknow.com

for more examples

I suspect this is not just a WhatDoTheyKnow theme issue. Do move this issue if it is.

Note Google hit counts suggest "cyfeiriad e-bost" is more commonly used than "cyfeiriad ebost"

garethrees commented 3 years ago

we are in some circumstances seeing email addresses replaced with [cyfeiriad ebost]

Taking a quick look at one of the requests, I can see that the Welsh phrase has been cached in the database:

im.cached_main_body_text_folded
#=> "...snip... Visible links\n 1. mailto:[cyfeiriad ebost]\n 2. http://www.gov.uk/dwp\n\n\n"

I assume the page has been visited via the CY locale first, and as such the Welsh replacement phrase was used when generating the cache.

It seems this is a very rare edge case…

SELECT COUNT(*) FROM incoming_messages WHERE cached_main_body_text_folded LIKE '%mailto:[cyfeiriad ebost]%';
 count
-------
   131 -- out of 1821178 incoming_messages total!

…and initially started happening in 2014 (though I don't know if something changed in 2014 that caused this).

SELECT id,created_at FROM incoming_messages WHERE cached_main_body_text_folded LIKE '%mailto:[cyfeiriad ebost]%' ORDER BY created_at ASC LIMIT 10;
   id    |         created_at
---------+----------------------------
  484508 | 2014-02-19 14:16:51.818949
  516847 | 2014-05-15 06:52:46.575731
  622506 | 2015-02-26 10:56:57.008951
  860266 | 2016-08-30 11:10:43.623673
  888699 | 2016-10-31 10:51:58.185605
  895408 | 2016-11-15 13:50:55.315092
 1035276 | 2017-09-11 10:44:35.511878
 1035281 | 2017-09-11 10:48:21.632624
 1049717 | 2017-10-09 09:29:49.574985
 1049749 | 2017-10-09 09:45:22.540962

No link to help text is offered either

It's not clickable because when we visit the request in EN, our https://github.com/mysociety/alaveteli/blob/0.39.1.1/app/models/incoming_message.rb#L620 function is using the EN phrase to match against ("email address"). The replacement is clickable when the request is viewed in Welsh because it's using the CY phrase to match against, which exists in the content.


We could regenerate the caches for each of these incoming messages. I tested this theory on one of the other requests, which, now that I've cleared the cache (msg.clear_in_database_caches!) and re-visited the page in EN, has [email address] cached.