internetstandards / Internet.nl

Internet standards compliance test suite
https://internet.nl
178 stars 38 forks source link

Update sectxt to 0.9.0 #1046

Open bwbroersma opened 1 year ago

bwbroersma commented 1 year ago

DigitalTrustCenter/sectxt released 0.9.0 with has quite a few parser improvements, especially on PGP.

The only one I'm not sure about is the stripping of the BOM (https://github.com/DigitalTrustCenter/sectxt/issues/57#issuecomment-1663592300). I interpret the RFC 9116 - File Format Description and ABNF Grammar:

The file format of the "security.txt" file MUST be plain text (MIME type "text/plain") as defined in Section 4.1.3 of [RFC2046] and MUST be encoded using UTF-8 [RFC3629] in Net-Unicode form [RFC5198].

RFC 5198 states:

  1. Net-Unicode Definition The Network Unicode format (Net-Unicode) is defined as follows. Parts of this definition are deliberately informal, providing guidance for specific profiles or rules in the protocols that reference this one rather than firm rules that apply globally. …
    1. As suggested in Section 6 of RFC 3629, the Byte Order Mark ("BOM") signature MUST NOT appear at the beginning of these text strings.

Especially in combination with signing maybe a :warning: warning or :information_source: notice should be shown. Although it's outside of the PGP block, a file with BOM is no longer recognized with file in Linux as a PGP signed file.

mxsasha commented 1 year ago

I'll find out which new content labels we need.

mxsasha commented 9 months ago

https://github.com/DigitalTrustCenter/sectxt/issues/65 is a blocker for this

mxsasha commented 7 months ago

Content still needs to be checked: all labels in https://github.com/DigitalTrustCenter/sectxt/ readme need to be in our content too.

bwbroersma commented 7 months ago

Crappy one-liner check (formatted on 3 lines for readability :sweat_smile:):

$ diff \
   <(grep -oP '"\K[a-z0-9]+_[a-z0-9_]+(?=")' sectxt/sectxt/__init__.py | sort -u) \
   <(ls internet.nl_content/detail/tech/data/http-securitytxt/ | sed 's/_..\.md$//g' | sort -u)
1d0
< bom_in_file
5,6c4
< field_name
< invalid_cert
---
> expired
12c10
< invalid_uri_scheme
---
> location
26c24,25
< no_security_txt
---
> no_security_txt_404
> no_security_txt_other
31d29
< pgp_envelope
33a32,33
> requested-from
> retrieved-from
35a36
> utf8

At least for sure currently these are missing:

At a manual inspection of sectxt I however see that invalid_uri_scheme and bom_in_file are in the SecurityTXT class, not in the Parser class that internet.nl uses. I'm don't see why bom_in_file is not checked in the Parser class. Created issue upstream:

bwbroersma commented 7 months ago

Upstream solved it in the 0.9.3 release.

bwbroersma commented 2 months ago

Although this is in milestone v1.9, it is already included and deployed in the 'batch' release v1.8.7.