file name encoding issues?

abubelinha commented 2 years ago

I am having troubles with this file (shared with me by somebody else), which contains accented characters in its title: https://drive.google.com/file/d/1yG2NeFXK0sPLL6QseY0LFfYt7j-a7RiJ

In browser, the file name looks normal (accents and tildes visible):

preparación_rafia_viña.jpg

But when processing with Python, those characters are replaced by escaped character codes.

preparacio\u0301n_rafia_vin\u0303a.jpg

This doesn't happen to me with other files using similar characters. In fact, if I make a copy of the file and rename it using Drive web interface, the problem disappears.

idFile = '1yG2NeFXK0sPLL6QseY0LFfYt7j-a7RiJ' # problematic file
idFile2 = '14qCAT6I5FuDbpH_1I7yv_EpVIIcSwPgJ' # copied file, and then renamed using Drive web interface
urlPrefix = 'https://drive.google.com/file/d/'
file = drive.CreateFile({'id': idFile})
file2 = drive.CreateFile({'id': idFile2})
print('title: %s \n url: %s%s' % (file['title'],urlPrefix,idFile)) # THIS LINE produces odd output
print('title: %s \n url: %s%s' % (file2['title'],urlPrefix,idFile2))

OUTPUT:

title: preparacio\u0301n_rafia_vin\u0303a.jpg url: https://drive.google.com/file/d/1yG2NeFXK0sPLL6QseY0LFfYt7j-a7RiJ title: preparación_rafia_viña 2.jpg url: https://drive.google.com/file/d/14qCAT6I5FuDbpH_1I7yv_EpVIIcSwPgJ

How can I solve this and make my print output be the same for both files? Well, actually the problem is more than printing.

This issue is impeding me to merge two datasets of filenames (one coming from PyDrive -with odd characters- and the other from local disk) because their names look different to Python (so merging on Pandas dataframe column filename is not working).

Thanks

shcheklein commented 2 years ago

Those two files have similar names, but they are written differently. Try to copy paste ó from both (in browser) and use this tool https://onlineunicodetools.com/convert-unicode-to-utf16 to see that in the first file we have:

0x006f 0x0301 pair

and in the second:

000a 00f3 pair

They might the same symbol, might be not. Usually, to deal with this kind of situations you would need to normalize names, using something like: https://stackoverflow.com/questions/16467479/normalizing-unicode

As for these \u0301 symbols - this is I believe something related to the local system setup, Python version, terminal, etc. There is nothing particularly wrong with that it just means that it can't be printed. In my browser and in my CLI both file names are totally fine and look the same:

abubelinha commented 2 years ago

Thanks for your answer @shcheklein I have lost two days struggling with this. Actually I didn't detect the problem when printing, but when trying to join datasets using filename as a unique joining key (it turned out not to be "unique").

So I guess in this case, the fact my CLI couldn't print one of them was pretty useful. Otherwise I wouldn't realize about the difference between strings and I would have thought something was wrong in my pandas merge query.

Just for my own reference, I leave here the code which solved my trouble (normalizing unicode names as you suggested):

import unicodedata
print(unicodedata.normalize('NFC',file['title']) ) 
print(unicodedata.normalize('NFC',file2['title']) )

iterative / PyDrive2

file name encoding issues? #172