johnwmillr / LyricsGenius

Download song lyrics and metadata from Genius.com 🎢🎀
http://www.johnwmillr.com/scraping-genius-lyrics/
MIT License
898 stars 159 forks source link

Fixed issue with printing unicode #126

Closed DarrelDonald closed 3 years ago

DarrelDonald commented 4 years ago

Whenever there were unicode characters that needed to be printed, an error would be produced. I encoded the print statements and it resolved the issue.

johnwmillr commented 4 years ago

Hi Darrel,

Can you post some examples of songs that were giving you errors before this change?

Thanks, John

DarrelDonald commented 4 years ago

"'Till I Collapse" by Eminem was the only one I encountered before modifying the code. I was trying to download all of Eminem's songs. I think there were a lot because I had it print a message in the console every time it would happen at first, but I couldn't see exactly which songs were doing it.

johnwmillr commented 4 years ago

I can't recreate this issue with the latest version of the package (1.8.2). Can you test your search with the latest version of the package? Or provide example code that produces the error?

John

DarrelDonald commented 4 years ago
python3 -m lyricsgenius song "'Till I Collapse" "Eminem" --save
Searching for "'Till I Collapse" by Eminem...
Done.
Traceback (most recent call last):
  File "\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "\lib\site-packages\lyricsgenius\__main__.py", line 56, in <module>
    main()
  File "\lib\site-packages\lyricsgenius\__main__.py", line 43, in main
    print("Saving lyrics to '{s}'...".format(s=song.title))
  File "\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2019' in position 18: character maps to <undefined>
johnwmillr commented 4 years ago

Hi @DarrelDonald, sorry for the delay. Are you still running into this issue? What version of Python and OS are you using?

DarrelDonald commented 4 years ago

I haven't used it since then. I was using Python 3.5 and my operating system was Windows 10.

allerter commented 4 years ago

@johnwmillr, this happened to me too recently. Try this:

song = genius.search_song('60 days sober and cool', artist='Noah Cyrus')
song.to_text('song.txt')

The problem is with the \u2005 character which is one of the space characters. The difference with Darrel's issue is that there was a Unicode character in the song's title, unlike mine which was in the lyrics. I guess we should encode all text to 'utf8' when printing/saving if this issue doesn't have to do with my Python environment. There is already an issue with this problem at #138

johnwmillr commented 4 years ago

@Allerter, does the snippet you shared produce an error for you? When running in my environment, the song saves without issue.

allerter commented 4 years ago

@johnwmillr, it does. I guess it might have to do with my environment. But either way, the problem in this issue is probably an actual problem. It's because the Windows console uses a different charset but Python 3 deals in Unicode. So when trying to print a string that has Unicode characters, it results in an error in Python 3.5 and lower. This was solved in 3.6 since Python bypasses console I/O to support Unicode. If someone with 3.5 and lower wanted to be able to print Unicode they would have to do this (from a solution on SO):

chcp 65001
set PYTHONIOENCODING=utf-8
johnwmillr commented 4 years ago

Thanks for the explanation, @Allerter. Do you know of a package-wide approach that would address the <= 3.5 issue that would be more robust than adding .encode('utf8')) wherever we print or save text?

allerter commented 4 years ago

@johnwmillr, unfortunately, I don't know of a package-wide way to achieve this. We could probably set the PYTHONIOENCODING environment variable to utf-8. That would solve the issue of printing Unicode characters. If we only set it once when the Genius class is instantiated, we would have to rely on the user not changing this later on. So I don't think that's a good idea. Looking at this question on StackOverflow, I think this might be the way to go:

allerter commented 3 years ago

Seems like unicodedata.normalize can solve the issue of printing Unicode to output in Python <=3.5.

allerter commented 3 years ago

:sweat_smile: I think I forgot to squash the merge. As unicodedata.normalize turned out not to work, I added the safe_unicode function in utils.py and used it wherever the package prints something that might lead to the UnicodeEncodeError. What do you think about this solution, @johnwmillr? Also, all the open()s that save lyrics, will now have encoding='utf8' which will solve saving lyrics that contatin Unicode characters (#138). This kinda removes the need for the binary_encoding parameter if it was only meant for the Unicode issue.

allerter commented 3 years ago

Another solution would be to use the logging module from Python's standard libraries.

allerter commented 3 years ago

I resolved the conflicts but there was a green commit merge and since I wasn't sure if it would update the package or the PR, I didn't submit it.

johnwmillr commented 3 years ago

The PR looks good! Thank you @DarrelDonald and @Allerter for your work on this. Merging now.