Open dsm-72 opened 1 year ago
Could you provide links to a few songs that have this garbage data?
I'm also experiencing similar behavior. Retrieving lyrics from Phoebe Bridgers, each song begins with "
This appears to be highly related to #237 and #215.
Package version: 3.0.1 OS: Windows 11 Python: 3.11.0
genius = Genius(GENIUS_ACCESS_TOKEN, remove_section_headers=True, skip_non_songs=True,
excluded_terms=["(Remix)", "(Live)", "(Version)", "(Voice)"])
artist = genius.search_artist("Phoebe Bridgers", max_songs = 10)
artist.save_lyrics("phoebe_1st_10", extension="txt", verbose=True, overwrite=True)
@LukeMoraglia
If you don't mind using some crass helper functions for scrubbing the strings....
import re, string, operator, math
# STRING STUFF
def remove_punctuation(s):
no_punc = str.maketrans('', '', string.punctuation)
return s.translate(no_punc)
def remove_extra_spaces(s):
return ' '.join(s.split())
def remove_apostrophe(s):
return s.replace('’', '')
def replace_apostrophe(s):
return s.replace('’', "'")
def remove_zero_width_space(s):
return s.replace('\u200b', '')
def remove_right_to_left_mark(s):
return s.replace('\u200f', '')
def scrub_string(s):
'''
Removes opinionated unwanted characters from
string, namely:
- zero width spaces '\u200b' ---> ''
- apostrophe '’' ---> ''
- extra spaces ' ' ---> ' '
'''
s = remove_zero_width_space(s)
s = remove_right_to_left_mark(s)
s = remove_apostrophe(s)
s = remove_extra_spaces(s)
return s
def replace_br(s):
s = s.replace('<br/>', '\n')
return s
def keep_until(s, substr, case_insensitive=False):
# Look for substr index and slice
if case_insensitive:
try:
index = s.lower().index(substr.lower())
return s[:index]
except ValueError:
# NOTE: index not found
return s
# Just split and take first
until, *after_ = s.split(substr)[0]
return until
def until_embded(s, case_insensitive=False, use_regex=True):
if use_regex:
pattern = f"Embed\d+\b"
found = re.findall(pattern, s, flags=re.IGNORECASE)
for f in found:
s.replace(f, '')
return s
else:
s = keep_until(s, 'Embed', case_insensitive=case_insensitive)
# NOTE: could be Embed1, Embed27, etc
if s != '':
while s[-1].isnumeric():
s = s[:-1]
return s
return s
def remove_see_live_ad(s, include_word_boundaries=True):
pattern = r"\bSee .+ Live\b" if include_word_boundaries else r"See .+ Live"
ads = re.findall(pattern, s, flags=re.IGNORECASE)
for ad in ads:
s = s.replace(s, '')
return s
def remove_square_brackets(s):
pattern = r"\[([A-Za-z0-9_]+)\]"
brackets = re.findall(pattern, s, flags=re.IGNORECASE)
for found in brackets:
s.replace(found, '')
return s
Then chain as needed
def clean_line(s):
s = remove_extra_spaces(s)
s = remove_see_live_ad(s)
# ....
s = replace_br(s)
return s
# lyrics is a list of strings ['These are some lyrics', 'then some more', ....]
clean_lyrics = list(map(clean_line, lyrics))
@dsm-72 This is what I was thinking I would probably end up doing. Thanks so much for sharing your solution!
Describe the bug Write a clear and concise description of what the bug is.
Lyrics objects often needs to be thoroughly scrubbed for:
s
'<Title> Lyrics
''[', ']', 'translations'
amongst characters"See <artist> Live
adsExpected behavior Write a clear and concise description of what you expected to happen. That the lyrics returned would be cleaner...
To Reproduce Describe the steps required to reproduce the behavior.
for song in artist.songs: song.lyrics...
Include the error message associated with the bug.
Version info
import lyricsgenius; print(lyricsgenius.__version__)
]Additional context Add any other context about the problem here.