johnwmillr / LyricsGenius

Download song lyrics and metadata from Genius.com 🎶🎤
http://www.johnwmillr.com/scraping-genius-lyrics/
MIT License
878 stars 159 forks source link

Remove the Hyperlink text from lyrics scrapper #218

Open bdubs1991 opened 2 years ago

bdubs1991 commented 2 years ago

When you use your package to scrape lyrics it includes text for the hyperlinks at the end of the lyrics, see attached screenshot. For a reproductible example, I have attached this in a jupyter notebook. image. This can be removed with some regex code I have created below. I am agnostic if this should be done to all lyrics or only when remove_section_headers=True is selected.

Potential Solution: hyperlinks_removed = re.sub(r"[0-9]+EmbedShare URLCopyEmbedCopy",'',lyrics)

Example for reproduction

import lyricsgenius as lg import genius_token as gt genius = lg.Genius(gt.token, # Client access token from Genius Client API page skip_non_songs=True, excluded_terms=["(Remix)", "(Live)"], remove_section_headers=True)

songs = (genius.search_artist('Kanye-west', max_songs=1, sort='popularity')).songs s = [song.lyrics for song in songs]

print(s[0][-30:])

Vuizur commented 2 years ago

Thanks for the regex, I had exactly the same problem.

I slightly modified the regex to hyperlinks_removed = re.sub(r"[0-9]*URLCopyEmbedCopy",'',lyrics) because the other one failed for songs that had zero shares.

wistephens commented 2 years ago

Yeah. I'm hitting this as well. I'll add the regex to me code for a short term fix

emorevival commented 2 years ago

I had to slightly modify Vuizur's solution because it was only getting the URLCopy part

(Javascript): let re = /[0-9].*URLCopyEmbedCopy/ lyrics.match(re)

jules-party commented 2 years ago

Thanks! I was wondering why the output on my songs lyrics were outputting this!