Closed pepsik-kiev closed 4 years ago
unicodedata.normalize does not do the same as re.sub ('[^ a-z0-9-_]'
See the following:
Test
filename = "filename/\?%*:|\"<>. "
return executing re.sub ('[^ a-z0-9-_]') = 'filename__'
return not executing re.sub ('[^ a-z0-9-_]') = 'filename_\\?%*:|"<>._'
Obviously without the regex replacement the filename would be invalid which is why this is in place.
Thanks for highlighting the issue that if the name is not all ASCII it causes our script to fail which we'll potentially look at fixing in a future version.
unicodedata.normalize does not do the same
Ok ... But then like this
import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
value = unicode(re.sub('[-\s]+', '-', value))
This is the "gold" standard for the Latin alphabet =) . Look at Django slugify() ...
re.sub('[-\s]+', '-', fname.decode('utf-8').translate({ord(c): None for c in '\/:%}{]["^$#@*,!?&
|><+='})).strip().lower()[:255]`
I have already fixed this in your source code (сharacter set of your choice), as well as many others required in my opinion. Everything works just fine. The performance was also tested using your front-end - everything works without errors. By the way, the frontend code is written very high quality. The only thing I would add in front-end is a check-box indicating that the link to the EPG should be taken from the m3u header of the # EXTM3U playlist in the 'url-tvg=' tag if present. Thanks for your development.
Closed. May consider for future release.
def get_safe_filename (filename): It does not work with NOT Latin characters. It is built on the basis of unicodedata.normalize, which is applicable to the European language groups based on the Latin alphabet. Moreover, you further do re.sub ('[^ a-z0-9-_]' - this is NOT necessary, because unicodedata.normalize has already done this, more precisely, "cleaned" the transmitted string from all invalid characters, why do you limit the file name to only letters and numbers? Is it somewhere in the standard? For example, try to parse this playlist
#EXTINF:-1 group-title="Детские" tvg-name="В гостях у сказки" tvg-logo="http://192.168.2.50:8081/stat/picons/N076JCPKNaCl1T8ANnskFElL4t4uE8.png",В гостях у сказки
#EXTINF:-1 group-title="Детские" tvg-name="Мульт" tvg-logo="http://static.acestream.net/sites/acestream/img/ACE-logo.png",Мульт
#EXTINF:-1 group-title="Познавательные" tvg-name="Охотник и рыболов HD" tvg-logo="http://static.acestream.net/sites/acestream/img/ACE-logo.png",Охотник и рыболов HD
And look at how you create bouquets. Somewhere in the standard, it is stipulated that the group-title tag, which you use to create userbouquet name, cannot contain letters of NOT Latin alphabetical alphabet? Or somewhere in the standard, it is provided that the file name can NOT be NOT Latin? Where in linux based systems is there such a limitation? Try in python console:import unicodedata
filename = 'этоимяфайла'
unicodedata.normalize('NFKD', unicode(filename, 'utf_8')).encode('ASCII', 'ignore')
''
filename = 'thisisfilename'
unicodedata.normalize('NFKD', unicode(filename, 'utf_8')).encode('ASCII', 'ignore')
'thisisfilename'