python / cpython

The Python programming language
https://www.python.org
Other
62.31k stars 29.93k forks source link

EmailMessage.get_filename() not unquoting url encodings #117710

Open vignesh-arivazhagan opened 5 months ago

vignesh-arivazhagan commented 5 months ago

Bug report

Bug description:

import requests
from email.message import EmailMessage

def get_filename_from_url(url, url_response=None):
    if url_response == None:
        url_response = requests.get(url)
        url_response.raise_for_status()

    content_disposition = url_response.headers.get("Content-Disposition")
    if content_disposition:
        email_message = EmailMessage()
        email_message["Content-Disposition"] = content_disposition
        return email_message.get_filename()

url = "https://www.gsi.gov.in/webcenter/ShowProperty;jsessionid=yv2xehEKwHR0ZHf64V2sMbrFzRSeSGCvcxDVr9F4_rbXBVtcgKbl!1598077039!1556223610?nodeId=%2FUCM%2FDCPORT1GSIGOVI063041%2F%2FidcPrimaryFile&revision=latestreleased"
get_filename_from_url(url)

output

annoncement_of%20computer%20application%20_rti_er%20_16122014.pdf

if i use

from urllib.parse import unquote
unquote(email_message.get_filename())

i am getting unquoted output

annoncement_of computer application _rti_er _16122014.pdf

why a different unquote function is used in EmailMessage.get_filename() ?

CPython versions tested on:

CPython main branch

Operating systems tested on:

Windows

medmunds commented 2 weeks ago

Short answer: that server is wrong. %-encoded strings are not allowed in the HTTP Content-Disposition header.

Longer answer:

Although Python's email package is meant to be able to parse HTTP headerscitation as well as email headers, the content you ask it to parse has to (more-or-less) follow the relevant specs.

The url in your example returns a Content-Disposition header that improperly uses (part of) a %-encoded URI as a filename. Here's the raw header in the server's response:

Content-Disposition: inline;filename=annoncement_of%20computer%20application%20_rti_er%20_16122014.pdf;

Nothing in the specs allows % encoding there. RFC 6266 specifies the HTTP Content-Disposition header. In section 4.1, 'filename-parm' is ultimately allowed to have a 'token' or 'quoted-string' value. Those are defined by RFC 2616 section 2.2—skip down to the top of page 17. Nothing there has anything to do with RFC 3986 style % encoding. (A MIME header 'quoted-string' is just in "double quotes"—it's unrelated to urlparse's quote() function.)

Either there's a bug in www.gsi.gov.in's server software, or (more likely) someone uploaded a file with %20's already in the name.

(Suggest closing this issue as "not planned.")