msimerson / mail-dmarc

Mail::DMARC, a complete DMARC implementation in Perl
Other
33 stars 23 forks source link

Invalid XML in generated reports #190

Open wolfgangkarall opened 3 years ago

wolfgangkarall commented 3 years ago

Describe the bug The user-configured org_name (at least) is taken as-is for XML and mail message bodies, but people tend to enter characters that are not suitable as-is in neither.

Examples:

Message Body:

Submitted by Sueño Fueguino
Generated with Mail::DMARC 1.20141206

Corresponding XML:

<org_name>Sue�o Fueguino</org_name>

Also in a more recent version (and this time already the message body is showing signs of breakage, too)

Submitted by Gwt7 IIA - Ingeniería e Informática Asociada
Generated with Mail::DMARC 1.20180125

and XML:

<org_name>Gwt7 IIA - Ingenier�a e Inform�tica Asociada</org_name>

When trying to view this report in Firefox it complains:

XML Parsing Error: not well-formed
Location: file:///home/user/.cache/.fr-wGu1gl/report.xml
Line Number 5, Column 32:
        <org_name>Gwt7 IIA - Ingenier￿
---------------------------------------------^

Other XML parsers complain or fail as well.

Note: I'm not an active user but suffer from the XML that gets send by Mail::DMARC on the receiving end that is not being processed by XML parsers because of this. I haven't got a report showing this issue sent by the latest version, but by the looks of it this is still the case in the current code.

marcbradshaw commented 3 years ago

Note: The Database schema (for mysql at least) specifies 'CHARACTER SET ascii', so this will need to be updated to handle the storage of UTF-8 in reports. rfc7489 specifies that domains must be converted to a-label form, but is ambiguous regarding the remaining data in the report. A quick fix may be to convert everything to ascii before saving the report, but this is likely to break (or at least not fix, because they are likely already broken) EAI addresses.

msimerson commented 1 month ago

A quick fix may be to convert everything to ascii before saving the report

Sounds like the right choice, based on my read of RFC 8616.

but this is likely to break (or at least not fix, because they are likely already broken) EAI addresses.

True, but will it matter? New reports will be saved with the new converted a-label form, which should fix all future reports, and solve this issue, right?

RFC 8616, Section 6

DMARC and Internationalized Mail

   DMARC RFC7489 defines a policy language that domain owners can
   specify for the domain of the address in an RFC5322.From header
   field.

   Section 6.6.1 of RFC7489 specifies, somewhat imprecisely, how IDNs
   in the RFC5322.From address domain are to be handled.  That section
   is updated to say that all U-labels in the domain are converted to
   A-labels before further processing.  Section 7.1 of RFC7489 is
   similarly updated to say that all U-labels in domains being handled
   are converted to A-labels before further processing.

   DMARC policy records, described in Sections 6.3 and 7.1 of RFC7489,
   can contain email addresses in the "rua" and "ruf" tags.  Since a
   policy record can be used for both internationalized and conventional
   mail, those addresses still have to be conventional addresses, not
   internationalized addresses.  DMARC and Internationalized Mail
   DMARC RFC7489 defines a policy language that domain owners can
   specify for the domain of the address in an RFC5322.From header
   field.

   Section 6.6.1 of RFC7489 specifies, somewhat imprecisely, how IDNs
   in the RFC5322.From address domain are to be handled.  That section
   is updated to say that all U-labels in the domain are converted to
   A-labels before further processing.  Section 7.1 of RFC7489 is
   similarly updated to say that all U-labels in domains being handled
   are converted to A-labels before further processing.

   DMARC policy records, described in Sections 6.3 and 7.1 of RFC7489,
   can contain email addresses in the "rua" and "ruf" tags.  Since a
   policy record can be used for both internationalized and conventional
   mail, those addresses still have to be conventional addresses, not
   internationalized addresses.
msimerson commented 1 month ago

Because the data column that stored domains and author info were explicitly declared as ASCII, I think (limited testing) that mysql would have converted any unicode characters to a ? character. Near as I can tell, that original character data is lost. If I'm wrong and MySQL stored the character code correctly, then the changes below will automatically do The Right Thing.

Now that MySQL 8 is the minimum supported version, changing the schema to enable UTF-8 chars is no longer messy and fraught with pitfalls.

The SQL code shown on the mysql wiki page should do the needful.