umbraco / Umbraco.Forms.Issues

Public issue tracker for Umbraco Forms
29 stars 0 forks source link

Blacklist email domains to prevent spam #1142

Open bjarnef opened 8 months ago

bjarnef commented 8 months ago

I have previous suggested a configuration to allow blacklisting email domains to prevent spam. https://github.com/umbraco/Umbraco.Forms.Issues/issues/169

Even with reCAPTCHA v3 it seems it is possible to bots to bypass it and sometimes a lot of spam entries are created.

E.g. in a quite new Umbraco project with reCAPTCHA v3:

image

E.g. it would help a lot to blacklist @raiz-pr.com ..

I have also seen something like @motorza.ru ...

Maybe Forms could even add a dashboard and detect what may look like spam, e.g. Form entries from email more that e.g. > 50 entries? with option to add it to blacklist.

On the other hand there could also be a whitelist for common email domains like @gmail.com, @outlook.com etc.

I know spam also comes via these (fictive) emails, but I think it at least could help a bit - especially if not using reCAPTCHA or Honeypot.

bjarnef commented 6 months ago

@AndyButland is this something considering to implement in Forms? E.g. with ReCaptcha v3 we still see many spam entries through forms.

E.g. some smaller Danish companies know that the never expect to receive mails from .ru email domains, which is some of the typical one to spamming with form entries.

AndyButland commented 6 months ago

I think will probably need to be custom code @bjarnef - at least for now. For one, Forms doesn't know what fields are ones to check (i.e. your field alias is likely "email", but we can't know that for sure). There's a FormValidateNotification notification you could hook into, or if you just wanted to silently not record it, there's also a RecordCreatingNotification you could cancel. I think that should give you the hooks you need.

bjarnef commented 6 months ago

@AndyButland thanks.. any specific going on between FormValidateNotification and RecordCreatingNotification I should be aware of? I guess workflows are executed after both as it executes based on data from record.

c9mb commented 6 months ago

@bjarnef you'd definitely need to know not only your form structure, but also your audience. I also get a fair bit of leakage from reCAPTCHA3 but most of it comes from fake @gmail.com type addresses - not easy if the bots are managing to fool Google and using 'plausible' addresses. If I was to dig into the headers, I could probably spot suspect IP addresses, but that's a bit full-on. Interestingly, I used a big-name site the other day to get support for a well known commercial software product, that targets an area for SMB - and they refused to accept form submissions from an @gmail.com account... which I think a bit extreme.

bjarnef commented 6 months ago

@c9mb yes, we can of course not blacklist @gmail.com, @outlook.com, @hotmail.com, etc.. and a whitelist would need to contain a lot of potentially business email domains.

However sometimes there are some patterns and same bots, crawlers (and perhaps humans as well) spamming a form, e.g. @motorza.ru or .ru, which a local small business in Denmark for instance never would expect to get mails from. So it depends on the customers, but often there are some patterns. Furthermore e.g. there are often spam from (fictive) @gmail.com accounts as well - sometimes random addresses, other times same address submit several entries, which humans typical don't - at least not the same day or within a few hours.

It the form contain a message field, the often contains several links as well when submitted from bots.

bjarnef commented 4 months ago

@AndyButland I wonder if there is something more to do about this? On an Umbraco Cloud project using Umbraco 12.3.9 and and Forms 12.2.4 they have about ~7K entries for a single form 😳🙈 Most of them from @registry.godaddy

The Form doesn't use reCAPTCHA but the HoneyPot package

We could hook into the form events as you previously mentioned https://github.com/umbraco/Umbraco.Forms.Issues/issues/1142#issuecomment-1987891890

Is the a simple way to cleanup in database?

DELETE * FROM [form records]
WHERE formId = [guid] AND email LIKE '%@registry.godaddy%'

email may depends on the field alias in the forms.

AndyButland commented 4 months ago

In SQL you'll need something like this to remove all the submissions that you've identified as spam. There are a few related tables to consider.

Please make sure to test on a backup first as I've just written it now

DECLARE @formId uniqueidentifier
DECLARE @fieldAlias nvarchar(255)
DECLARE @value nvarchar(255)

SET @formId = '<your form guid>'
SET @fieldAlias = '<your email field's alias>'
SET @value = '@registry.godaddy'

-- Get IDs of records to remove
SELECT r.Id
INTO #recordIds
FROM UFRecords r
INNER JOIN UFRecordFields rf ON rf.Record = r.Id
INNER JOIN UFRecordDataString rdf ON rdf.[Key] = rf.[Key]
WHERE r.Form = @formId
AND rf.Alias = @fieldAlias
AND rdf.Value LIKE '%' + @value + '%'

--Delete record from all tables
DELETE FROM UFRecordDataBit WHERE [Key] IN (
    SELECT [Key] FROM UFRecordFields WHERE Record IN (SELECT Id FROM #recordIds)
)
DELETE FROM UFRecordDataDateTime WHERE [Key] IN (
    SELECT [Key] FROM UFRecordFields WHERE Record IN (SELECT Id FROM #recordIds)
)
DELETE FROM UFRecordDataInteger WHERE [Key] IN (
    SELECT [Key] FROM UFRecordFields WHERE Record IN (SELECT Id FROM #recordIds)
)
DELETE FROM UFRecordDataLongString WHERE [Key] IN (
    SELECT [Key] FROM UFRecordFields WHERE Record IN (SELECT Id FROM #recordIds)
)
DELETE FROM UFRecordDataString WHERE [Key] IN (
    SELECT [Key] FROM UFRecordFields WHERE Record IN (SELECT Id FROM #recordIds)
)

DELETE FROM UFRecordFields WHERE Record IN (SELECT Id FROM #recordIds)

DELETE FROM UFRecords WHERE ID IN (SELECT Id FROM #recordIds)

-- Clean up
DROP TABLE #recordIds
bjarnef commented 4 months ago

@AndyButland when looking into this from FormValidateNotification we can access UserAgent string notification.Context.Request.Headers["User-Agent"].

With RecordCreatingNotification each item in SavedEntities has a IP property. Does it know about the UserAgent at this state? It could of course add a hidden field to form and magic string, but if there an other way to pass in other information from the original request?

Often we could detect if the request is from a bot/crawler and minimize the amount of spam: https://stackoverflow.com/questions/544450/detecting-honest-web-crawlers