tobinus / python-podgen

Generating podcasts with Python should be easy!
https://podgen.readthedocs.org
Other
51 stars 10 forks source link

hashtag char in podcast episode filename prevents correct guess of mimetype #56

Closed lgaggini closed 7 years ago

lgaggini commented 7 years ago

Hi and cheers for this nice library. :)

I spot a bit odd situation: if a hastag char is present in the filename of a podcast episode the get_type(self, url) function from Media fails because of urlparse. As default urlparse recognizes substring after the hashtag as fragment and not as part of the path:

In [3]: urlparse('https://mixedeuphoria.net/lgaggini_-_mixed_euphoria_#001.mp3')
Out[3]: ParseResult(scheme='https', netloc='mixedeuphoria.net', path='/lgaggini_-_mixed_euphoria_', params='', query='', fragment='001.mp3')`

For a podcast use case I think it's better to recognize the substring after the hastag as part of the path using setting allow_fragments=False (the implicit default is True):

In [4]: urlparse('https://mixedeuphoria.net/lgaggini_-_mixed_euphoria_#001.mp3', allow_fragments=False)
Out[4]: ParseResult(scheme='https', netloc='mixedeuphoria.net', path='/lgaggini_-_mixed_euphoria_#001.mp3', params='', query='', fragment='')
tobinus commented 7 years ago

Thank you for taking your time to report this! And thanks for the great words! It is great to see others using this library :) (that reminds me, I should get around to make a proper 1.0 release. Won't happen until I'm done with my examinations in May, though…)

I've actually encountered this issue before. The thing is, the hash mark has a special meaning in URI. It is used to tell the client which part of the document to scroll to (at least in HTML). I recommend taking a look at the Wikipedia article.

As the article says, "The fragment identifier functions differently than the rest of the URI: namely, its processing is exclusively client-side with no participation from the web server". What the webserver sees, is just the part before the hash mark, specifically https://mixedeuphoria.net/lgaggini_-_mixed_euphoria_, since the client holds on to the fragment and processes it itself. Thus, I think the current behaviour is actually the correct one, since the function handles the hash mark the same as any podcast clients (should) do. (This is not obvious and something everyone knows, so please don't feel stupid :P As I've said, I've made the same mistake myself.)

The solution is to encode the hash mark (escaping it). In Python3, you'd do that by using urllib.parse.quote, like this:

import urllib.parse
base = 'https://mixedeuphoria.net/'
filename = 'lgaggini_-_mixed_euphoria_#001.mp3'
url = base + urllib.parse.quote(filename)
m = Media(url, ...)

In Python2, the same function is found at urllib.quote.

When I get time next week, I will describe this in the manual. It is very likely that other people will encounter this issue since it is very common to use the hash mark in episode titles, so I think it is worth pointing out. I hope I could be of help :smile:

lgaggini commented 7 years ago

Yes, I agree, the best way to handle this is to apply escaping and to leave urlparse works the default way. :+1: