internetstandards / Internet.nl

Internet standards compliance test suite
https://internet.nl
178 stars 37 forks source link

Change User-Agent to common crawler format #1224

Closed bwbroersma closed 8 months ago

bwbroersma commented 10 months ago

Currently internetnl/1.0 is used, this is not ideal since it's not a common format plus since docker others can easily spin up their own instance and the UA should reflect at least the correct link to contact the server/person crawling.

As mentioned before in https://github.com/internetstandards/Internet.nl/issues/363#issuecomment-1860475407 and https://github.com/internetstandards/Internet.nl/issues/1042#issuecomment-1687697840 I would prefer to change this to a common bot user-agent like also listed in MDN.

The more standardized and accepted User-Agent is Mozilla/5.0 (compatible; SoftwareName/0.1.2; +https://internet.nl/) where the last + part could be the deployed instance (for a protected batch server another public page could be used, plus maybe include some #user-id-token, I've seen monitoring systems that do this). The + part should be configurable, but could default to the current instance domain variable already used.

So I suggest for us: Mozilla/5.0 (compatible; internetnl/1.8.3; +https://internet.nl/about/) Ideally we would even setup a 'bot' page like http://www.google.com/bot.html.


The RFC 1945 - 10.5 User-Agent is not strict:

User-Agent     = "User-Agent" ":" 1*( product | comment )

3.7 Product Tokens defines:

product         = token ["/" product-version]
product-version = token

2.2 Basic Rules defines the comment as:

comment        = "(" *( ctext | comment ) ")"
ctext          = <any TEXT excluding "(" and ")">

A string of text is parsed as a single word if it is quoted using double-quote marks.


quoted-string  = ( <"> *(qdtext) <"> )

qdtext = <any CHAR except <"> and CTLs, but including LWS>

mdavids commented 10 months ago

I agree. Relevant reading, perhaps, here: https://en.wikipedia.org/wiki/User-Agent_header#User_agent_spoofing This may mean that less-popular browsers are not sent complex content (even though they might be able to deal with it correctly) or, in extreme cases, refused all content.

bwbroersma commented 10 months ago

If ignoring this:

It's just these two locations: https://github.com/internetstandards/Internet.nl/blob/742676088ac86a4c6017491831ac14e981b26de5/checks/http_client.py#L62 https://github.com/internetstandards/Internet.nl/blob/742676088ac86a4c6017491831ac14e981b26de5/checks/tasks/tls_connection.py#L663

Remaining questions are:

baknu commented 10 months ago

Latest RFC on User-Agent header: https://www.rfc-editor.org/rfc/rfc9110.html#name-user-agent

baknu commented 10 months ago

Question: What User-Agent header are other test tools using?

bwbroersma commented 10 months ago
Tool User-Agent
W3C Markup Validation Service W3C_Validator/1.3 http://validator.w3.org/services (IPv6) and
Validator.nu/LV http://validator.w3.org/services (IPv4)
W3C CSS Validation Service Jigsaw/2.3.0 W3C_CSS_Validator_JFouffa/2.0 (See <http://validator.w3.org/services>)
SSL Labs - Test SSL SSL Labs (https://www.ssllabs.com/about/assessment.html)
Plus query parameter: ?SSL_Labs_Renegotiation_Test=User_Agent_May_Not_Show
Security Headers Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36 SecurityHeaders
Plus Referer: https://securityheaders.com/
Hardenize Hardenize (https://www.hardenize.com) and
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36 Hardenize
baknu commented 10 months ago

Thanks! See also: https://udger.com/resources/ua-list/crawlers

mdavids commented 10 months ago

Oh, cool' we're on that list: https://udger.com/resources/ua-list/bot-detail?bot=internetnl#id131933

bwbroersma commented 9 months ago

Priority for this issue is asked by a governmental agency, currently the IPv4/IPv6 compare fails because the User-Agent internetnl/1.0 results in a 401, which is a failure because of https://github.com/internetstandards/Internet.nl/issues/1226. Funny thing is, I always use Mozilla/5.0 when sending requests without an User-Agent is blocked, and this magic Mozilla/5.0 also works on this 'hardened' system.

bwbroersma commented 9 months ago

For the record: I'm proposing to put internetnl and the version string in the comment field only.

Decided with @baknu:

Note again, internet.nl does not always send a User-Agent, which is a separate bug: