johnwmillr / LyricsGenius

Download song lyrics and metadata from Genius.com 🎶🎤
http://www.johnwmillr.com/scraping-genius-lyrics/
MIT License
878 stars 159 forks source link

fix ads and other unwanted content in lyrics #272

Open xathon opened 2 months ago

xathon commented 2 months ago

This PR not only contains a fix for the embed string in the end of the lyrics, but also the ticket ads and contributor numbers that have shown up in my scrapes.

vinchilive commented 2 months ago

But it removes the new line for ads block:

Снимок экрана 2024-05-10 в 19 20 27

@xathon so maybe instead of decompose we can use replaceWith('\n') for ads?

xathon commented 2 months ago

Ah, good catch. I was using the function to remove the markers, so didn't see that. I'll be back home on Monday, I can put that in then.

vinchilive commented 2 months ago

Its not perfect either so I ended up using lyrics = re.sub(r"(?<!\n)\n(\[)", r"\n\n\1", lyrics) to add missing newlines before [blocks] :)