johnwmillr / LyricsGenius

Download song lyrics and metadata from Genius.com 🎶🎤
http://www.johnwmillr.com/scraping-genius-lyrics/
MIT License
898 stars 159 forks source link

Timeouts at seemingly random moments #121

Closed Arsanian closed 3 years ago

Arsanian commented 4 years ago

I'm trying to download a huge number of lyrics for a university project. I have files that represent a genre which contain 50 artists I want to download all lyrics from.

So I wrote a python script that scans the folder and reads the lists one by one, trying to download the lyrics for every artist in these lists.

Sometimes the following happens:

Timeout raised and caught: HTTPSConnectionPool(host='api.genius.com', port=443): Read timed out. (read timeout=5) Traceback (most recent call last): File "lyricsapi.py", line 54, in artist = api.search_artist(a.strip(), max_songs=max_songs, sort="title") File "/home/duke/anaconda3/envs/dynamusic/lib/python3.7/site-packages/lyricsgenius/api.py", line 356, in search_artist song = Song(info, lyrics) File "/home/duke/anaconda3/envs/dynamusic/lib/python3.7/site-packages/lyricsgenius/song.py", line 26, in init self._body = json_dict['song'] if 'song' in json_dict else json_dict TypeError: argument of type 'NoneType' is not iterable

This error happens pretty randomly, sometimes after 50 texts, sometimes after 600. Earlier today it happened after downloading 113 texts by Eminem, but in the next try it managed to download all 490 of his songs, just to fail after a few songs from the next artist in line.

This also happened, when I ran the script on my server, which has a separate internet connection.

Version info

mxdillon commented 4 years ago

I'm facing the same issue

GiorgioGhisotti commented 4 years ago

A workaround for this is to use a try...except block and place the request in a while loop

artists = []
while True:
    try:
        artists.append(genius.search_artist(artist, max_songs=10000))
        break
    except:
        pass

This will simply retry the call until it works. I successfully used this to scrape the full discography of 50 artists and I didn't run into any further problems.

dmlunde commented 4 years ago

@Arsanian how did you manage to narrow down the Eminem number of songs to 490?

danielhorizon commented 4 years ago

I've tried the above and am still getting a timeout..

"HTTPSConnectionPool(host='api.genius.com', port=443): Read timed out. (read timeout=5)"

Any suggestions? I've tried using a timeout as well (for 60 seconds) and tried the while() and a try/catch.

ArinkB commented 3 years ago

I am also having this same issue, my loop is pulling lyrics based on the artist name and song title. then appending that to a list. I have a try and except and the error still pops up. I also have time.sleep(15) just in case. The code can run anywhere from 30min - 5hours. It requires a lot of time monitoring.

allerter commented 3 years ago

@ArinkB, could you please provide the following info so we can re-create and debug your issue:

ArinkB commented 3 years ago

@ArinkB, could you please provide the following info so we can re-create and debug your issue:

  • the version of LyricsGenius
  • your traceback
  • a minimal working script so that we can re-create the error.

sure, the dataframe: image

lyrics = []

def get_lyrics(): #no arguments needed
    while len(lyrics) != len(end_df): 
        genius = lyricsgenius.Genius("API KEY") # call to lyricsgenius
        for track in end_df.values: 
            song = genius.search_song(track[2], track[0])
            try:    
                lyrics.append(song.lyrics) 
            except:
                lyrics.append(np.NAN) 
        time.sleep(40)

The error: D:\Anaconda\lib\site-packages\lyricsgenius\api\base.py in _makerequest(self, path, method, params, public_api, **kwargs) 58 except Timeout as e: 59 error = "Request timed out:\n{e}".format(e=e) ---> 60 raise Timeout(error) 61 except HTTPError as e: 62 error = str(e)

Timeout: Request timed out: HTTPSConnectionPool(host='api.genius.com', port=443): Read timed out. (read timeout=5)

allerter commented 3 years ago

@ArinkB, thanks for providing the information. Although this issue is probably a valid issue, I don't think your script's primary issue is the one with the Timeout. I tested Spotify's Viral 50 songs using your script and here are a couple of things that you could improve:

from requests.exceptions import Timeout

lyrics = []

def get_lyrics():
    # while len(lyrics) != len(end_df): #1
    genius = lyricsgenius.Genius(token)
    genius.timeout = 15
    genius.sleep_time = 40  # 2
    # or: Genius(token, timeout=15, sleep_time=40)
    for track in end_df.values:
        retries = 0
        while retries < 3:
            try:
                song = genius.search_song(track[2], track[0])
            except Timeout as e:
                retries += 1
                continue
            if song is not None:
                lyrics.append(song.lyrics)
            else:
                lyrics.append(np.NAN)
            break
  1. This will result in an infinite loop since some songs can't be found, and there's no need for it in the first place.
  2. With the genius.sleep_time attribute, there's no need for time.sleep(40) anymore. Also, I don't think there's a need for a 40-sec sleep from the API's end. When I tested your script, I removed the time.sleep(40) line and everything worked fine.

Now your script will search for the songs and in case of timeouts, your script will retry the search three times before moving on to the next song (this should probably be a feature, @johnwmillr).

ArinkB commented 3 years ago

@Allerter Thank you! I appreciate your help and insight. It has been pulling for 3 hours now and no issues so far.

NIkitabala commented 3 years ago

@ArinkB Hi, can you show me, how exactly do you use your script? I'm trying to use this solution, but I'm still getting an error.

ArinkB commented 3 years ago

@NIkitabala sure, this is the notebook I used it in, I modified it slightly because my original project plan didn't work out at the time: https://github.com/ArinkB/Predicting-Song-Skips/blob/master/1_Data%20Acquisition.ipynb

allerter commented 3 years ago

Based on this comment that I posted on #168, I think these random timeout errors will be solved by #162. We'll see.