Closed abubelinha closed 2 years ago
Those two files have similar names, but they are written differently. Try to copy paste ó
from both (in browser) and use this tool https://onlineunicodetools.com/convert-unicode-to-utf16 to see that in the first file we have:
0x006f 0x0301
pair
and in the second:
000a 00f3
pair
They might the same symbol, might be not. Usually, to deal with this kind of situations you would need to normalize names, using something like: https://stackoverflow.com/questions/16467479/normalizing-unicode
As for these \u0301
symbols - this is I believe something related to the local system setup, Python version, terminal, etc. There is nothing particularly wrong with that it just means that it can't be printed. In my browser and in my CLI both file names are totally fine and look the same:
Thanks for your answer @shcheklein I have lost two days struggling with this. Actually I didn't detect the problem when printing, but when trying to join datasets using filename as a unique joining key (it turned out not to be "unique").
So I guess in this case, the fact my CLI couldn't print one of them was pretty useful. Otherwise I wouldn't realize about the difference between strings and I would have thought something was wrong in my pandas merge query.
Just for my own reference, I leave here the code which solved my trouble (normalizing unicode names as you suggested):
import unicodedata
print(unicodedata.normalize('NFC',file['title']) )
print(unicodedata.normalize('NFC',file2['title']) )
I am having troubles with this file (shared with me by somebody else), which contains accented characters in its title: https://drive.google.com/file/d/1yG2NeFXK0sPLL6QseY0LFfYt7j-a7RiJ
In browser, the file name looks normal (accents and tildes visible):
But when processing with Python, those characters are replaced by escaped character codes.
This doesn't happen to me with other files using similar characters. In fact, if I make a copy of the file and rename it using Drive web interface, the problem disappears.
OUTPUT:
How can I solve this and make my print output be the same for both files? Well, actually the problem is more than printing.
This issue is impeding me to merge two datasets of filenames (one coming from PyDrive -with odd characters- and the other from local disk) because their names look different to Python (so merging on Pandas dataframe column
filename
is not working).Thanks