johnwmillr / LyricsGenius

Download song lyrics and metadata from Genius.com 🎶🎤
http://www.johnwmillr.com/scraping-genius-lyrics/
MIT License
878 stars 159 forks source link

Regex for excluded_terms #261

Closed yorkshirelandscape closed 9 months ago

yorkshirelandscape commented 10 months ago

Hiya. Thanks for coming up with this package. I'm populating the brain of a chatbot with lyrics and it's been indispensable. I'm having a hard time, however, using excluded_terms properly. I can't be guaranteed that I'm getting all the lyrics at once. Others may be added later, just for example. So I store all the titles I've pulled in a JSON file, then when I try another search on the same artist, I search for the artist in lyrics.json and add any song titles I've already pulled to excluded terms.

The problem is punctuation. Any sort of punctuation at all breaks the comparison and they get repeated. I've tried using re.escape(), but it seems like excluded_terms is not expecting \\, style escapes. I've even tried manually escaping the regex special characters, but it comes out the same way. Can you recommend a method for adding whole song titles with punctuation to excluded terms? Thanks for thinking about it either way.

Here's the bit of code I'm using (along with a couple commented alternatives):

exc_terms = [ "acoustic", "remix", "mix$", "live$", "live at", 
              "live in", "demo$", "version", "DVD", "edit$",
              "booklet", "album", "live from", "extended" ]

# Grab existing song titles to add to exclusion list
# pattern = re.compile( '([\[\]$&+,:;=?@#\'<>.^*()%!-])' )
if os.path.isfile( 'lyrics.json' ):
    with open( 'lyrics.json', 'r' ) as file:
        artists = json.load( file )
    titles = []
    found_artist = next( ( artist for artist in artists if artist[ "artist" ] == args.artist ), None )
    if found_artist:
        for s in found_artist[ "songs" ]:
            # titles.append( re.sub( pattern, r'\\\1', s["title"] ) )
            titles.append( s['title'] )
        for s in titles:
            exc_terms.append( s )
        print( titles )
genius.excluded_terms = exc_terms
yorkshirelandscape commented 9 months ago

I figured it out. It wasn't anything to do with the regex. It's that lyricsgenius would internally call clean_str on the song titles before checking them, so I had to do the same with my excluded titles.