melaniewalsh / Intro-Cultural-Analytics

Introduction to Cultural Analytics & Python, course website and online textbook powered by Jupyter Book
https://melaniewalsh.github.io/Intro-Cultural-Analytics
GNU General Public License v3.0
256 stars 86 forks source link

Issue on page /04-Data-Collection/08-Collect-Genius-Lyrics.html #36

Open adamlporter opened 1 year ago

adamlporter commented 1 year ago

When I tried to work through this page, I got an error when trying to execute

artist = LyricsGenius.search_artist("Missy Elliott", max_songs=6)

The error is

HTTPError: 403 Client Error: Forbidden for url: https://genius.com/api/search/multi?q=Missy+Elliott

Apparently, genius.com has changed one (or more) of their settings, so that LyricsGenius no longer works. See https://stackoverflow.com/questions/72078610/getting-lyrics-from-genius-api-gives-error https://github.com/johnwmillr/LyricsGenius/issues/190 https://github.com/johnwmillr/LyricsGenius/issues/220 The conclusion from these is (unhappily) not to use LyricsGenius.

adamlporter commented 1 year ago

The procredures clean_up() and get_all_songs_from_the_album() work. I rewrote Melanie Walsh's download_album_lyrics() procedure to work without accessing LyricsGenius.

def download_album_lyrics(artist, album_name):
    clean_songs = get_all_songs_from_album(artist, album_name)

    artist = artist.replace(" ", "-")
    album_name = album_name.replace(' ','-')

    for song in clean_songs:
        song_title = re.sub("[^\w\s]",'',song) #get rid of punctuation
        song_title = song_title.replace(' ','-')
        try:
            url = f"https://genius.com/{artist}-{song_title}-lyrics"
            response = requests.get(url)
            if response.status_code == 200:
                Path(f"{artist}_{album_name}").mkdir(parents=True, exist_ok=True)
                html = response.text
                document = BeautifulSoup(html, "html.parser")
                div = document.find("div", class_=re.compile("^lyrics$|Lyrics__Root"))
                try:
                    lyrics = div.get_text("\n")
                    filen = f"{artist}-{album_name}/{song_title}.txt"
                    with open(filen, 'w') as file:
                        file.write(lyrics)
                    print(f"saving {filen}")
                except AttributeError:
                    print(f"No lyrics found for {song_title}")

            else:
                print(f"problem getting lyrics for {artist} - {song_title}")
                print(f"error code was {response.status_code}")
        except FileNotFoundError:
            print(f"{url} is not found")

I have tested this and is works -- sort of. I was able to download the lyrics for three albums, then the requests.get(url) started throwing FileNotFoundErrors.

I suspect genius.com is tracking IP addresses and starts blacklisting them if they make too many requests (either total or in a specific period of time). Interestingly, even after the download_album_lyrics() stops working, the get_all_songs_from_album() continues to work.

adamlporter commented 1 year ago

It might be possible to replace genius.com with lyrics.com. The latter site has an easier HTML structure that makes it possible to extract lyrics text without using a regular expression. (This may be similar to what genius.com used when Melanie first wrote the textbook.)

response = requests.get("https://www.lyrics.com/lyric/8237688")
html = response.text
document = BeautifulSoup(html, "html.parser")
print(document.find('pre').text)