su1s / e2m3u2bouquet

Enigma2 IPTV m3u parser and bouquet creator
GNU General Public License v3.0
68 stars 50 forks source link

File name normalization #111

Closed pepsik-kiev closed 4 years ago

pepsik-kiev commented 4 years ago

def get_safe_filename (filename): It does not work with NOT Latin characters. It is built on the basis of unicodedata.normalize, which is applicable to the European language groups based on the Latin alphabet. Moreover, you further do re.sub ('[^ a-z0-9-_]' - this is NOT necessary, because unicodedata.normalize has already done this, more precisely, "cleaned" the transmitted string from all invalid characters, why do you limit the file name to only letters and numbers? Is it somewhere in the standard? For example, try to parse this playlist #EXTINF:-1 group-title="Детские" tvg-name="В гостях у сказки" tvg-logo="http://192.168.2.50:8081/stat/picons/N076JCPKNaCl1T8ANnskFElL4t4uE8.png",В гостях у сказки #EXTINF:-1 group-title="Детские" tvg-name="Мульт" tvg-logo="http://static.acestream.net/sites/acestream/img/ACE-logo.png",Мульт #EXTINF:-1 group-title="Познавательные" tvg-name="Охотник и рыболов HD" tvg-logo="http://static.acestream.net/sites/acestream/img/ACE-logo.png",Охотник и рыболов HD And look at how you create bouquets. Somewhere in the standard, it is stipulated that the group-title tag, which you use to create userbouquet name, cannot contain letters of NOT Latin alphabetical alphabet? Or somewhere in the standard, it is provided that the file name can NOT be NOT Latin? Where in linux based systems is there such a limitation? Try in python console: import unicodedata filename = 'этоимяфайла' unicodedata.normalize('NFKD', unicode(filename, 'utf_8')).encode('ASCII', 'ignore') '' filename = 'thisisfilename' unicodedata.normalize('NFKD', unicode(filename, 'utf_8')).encode('ASCII', 'ignore') 'thisisfilename'

DougMac commented 4 years ago

unicodedata.normalize does not do the same as re.sub ('[^ a-z0-9-_]'

See the following:

Test

filename = "filename/\?%*:|\"<>. "
return executing re.sub ('[^ a-z0-9-_]') = 'filename__'
return not executing re.sub ('[^ a-z0-9-_]') = 'filename_\\?%*:|"<>._'

Obviously without the regex replacement the filename would be invalid which is why this is in place.

Thanks for highlighting the issue that if the name is not all ASCII it causes our script to fail which we'll potentially look at fixing in a future version.

pepsik-kiev commented 4 years ago

unicodedata.normalize does not do the same

Ok ... But then like this

import unicodedata value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore') value = unicode(re.sub('[^\w\s-]', '', value).strip().lower()) value = unicode(re.sub('[-\s]+', '-', value))

This is the "gold" standard for the Latin alphabet =) . Look at Django slugify() ...

re.sub('[-\s]+', '-', fname.decode('utf-8').translate({ord(c): None for c in '\/:%}{]["^$#@*,!?&|><+='})).strip().lower()[:255]`

I have already fixed this in your source code (сharacter set of your choice), as well as many others required in my opinion. Everything works just fine. The performance was also tested using your front-end - everything works without errors. By the way, the frontend code is written very high quality. The only thing I would add in front-end is a check-box indicating that the link to the EPG should be taken from the m3u header of the # EXTM3U playlist in the 'url-tvg=' tag if present. Thanks for your development.

DougMac commented 4 years ago

Closed. May consider for future release.