johnwmillr / LyricsGenius

Download song lyrics and metadata from Genius.com 🎶🎤
http://www.johnwmillr.com/scraping-genius-lyrics/
MIT License
878 stars 159 forks source link

Lyrics returned is buggy and includes ads? #254

Open dsm-72 opened 1 year ago

dsm-72 commented 1 year ago

Describe the bug Write a clear and concise description of what the bug is.

Lyrics objects often needs to be thoroughly scrubbed for:

Expected behavior Write a clear and concise description of what you expected to happen. That the lyrics returned would be cleaner...

To Reproduce Describe the steps required to reproduce the behavior.

  1. `genius.artist(...)
  2. for song in artist.songs: song.lyrics...

Include the error message associated with the bug.

Version info

Additional context Add any other context about the problem here.

allerter commented 1 year ago

Could you provide links to a few songs that have this garbage data?

LukeMoraglia commented 1 year ago

I'm also experiencing similar behavior. Retrieving lyrics from Phoebe Bridgers, each song begins with " Lyrics", throughout the lyrics are lines that say "See Phoebe Bridgers LiveGet tickets as low as $66You might also like" and then the song lyrics typically end with a number followed by the word Embed. I also see the occasional "TranslationsPortuguêsItaliano" or other languages.</p> <p>This appears to be highly related to #237 and #215. </p> <p>Package version: 3.0.1 OS: Windows 11 Python: 3.11.0</p> <pre><code>genius = Genius(GENIUS_ACCESS_TOKEN, remove_section_headers=True, skip_non_songs=True, excluded_terms=["(Remix)", "(Live)", "(Version)", "(Voice)"]) artist = genius.search_artist("Phoebe Bridgers", max_songs = 10) artist.save_lyrics("phoebe_1st_10", extension="txt", verbose=True, overwrite=True)</code></pre> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/dsm-72"><img src="https://avatars.githubusercontent.com/u/100158155?v=4" />dsm-72</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>@LukeMoraglia </p> <p>If you don't mind using some crass helper functions for scrubbing the strings....</p> <pre><code class="language-python">import re, string, operator, math # STRING STUFF def remove_punctuation(s): no_punc = str.maketrans('', '', string.punctuation) return s.translate(no_punc) def remove_extra_spaces(s): return ' '.join(s.split()) def remove_apostrophe(s): return s.replace('’', '') def replace_apostrophe(s): return s.replace('’', "'") def remove_zero_width_space(s): return s.replace('\u200b', '') def remove_right_to_left_mark(s): return s.replace('\u200f', '') def scrub_string(s): ''' Removes opinionated unwanted characters from string, namely: - zero width spaces '\u200b' ---> '' - apostrophe '’' ---> '' - extra spaces ' ' ---> ' ' ''' s = remove_zero_width_space(s) s = remove_right_to_left_mark(s) s = remove_apostrophe(s) s = remove_extra_spaces(s) return s def replace_br(s): s = s.replace('<br/>', '\n') return s def keep_until(s, substr, case_insensitive=False): # Look for substr index and slice if case_insensitive: try: index = s.lower().index(substr.lower()) return s[:index] except ValueError: # NOTE: index not found return s # Just split and take first until, *after_ = s.split(substr)[0] return until def until_embded(s, case_insensitive=False, use_regex=True): if use_regex: pattern = f"Embed\d+\b" found = re.findall(pattern, s, flags=re.IGNORECASE) for f in found: s.replace(f, '') return s else: s = keep_until(s, 'Embed', case_insensitive=case_insensitive) # NOTE: could be Embed1, Embed27, etc if s != '': while s[-1].isnumeric(): s = s[:-1] return s return s def remove_see_live_ad(s, include_word_boundaries=True): pattern = r"\bSee .+ Live\b" if include_word_boundaries else r"See .+ Live" ads = re.findall(pattern, s, flags=re.IGNORECASE) for ad in ads: s = s.replace(s, '') return s def remove_square_brackets(s): pattern = r"\[([A-Za-z0-9_]+)\]" brackets = re.findall(pattern, s, flags=re.IGNORECASE) for found in brackets: s.replace(found, '') return s </code></pre> <p>Then chain as needed</p> <pre><code>def clean_line(s): s = remove_extra_spaces(s) s = remove_see_live_ad(s) # .... s = replace_br(s) return s # lyrics is a list of strings ['These are some lyrics', 'then some more', ....] clean_lyrics = list(map(clean_line, lyrics))</code></pre> </div> </div> <div class="comment"> <div class="user"> <a rel="noreferrer nofollow" target="_blank" href="https://github.com/LukeMoraglia"><img src="https://avatars.githubusercontent.com/u/55449109?v=4" />LukeMoraglia</a> commented <strong> 1 year ago</strong> </div> <div class="markdown-body"> <p>@dsm-72 This is what I was thinking I would probably end up doing. Thanks so much for sharing your solution! </p> </div> </div> <div class="page-bar-simple"> </div> <div class="footer"> <ul class="body"> <li>© <script> document.write(new Date().getFullYear()) </script> Githubissues.</li> <li>Githubissues is a development platform for aggregating issues.</li> </ul> </div> <script src="https://cdn.jsdelivr.net/npm/jquery@3.5.1/dist/jquery.min.js"></script> <script src="/githubissues/assets/js.js"></script> <script src="/githubissues/assets/markdown.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/highlight.min.js"></script> <script src="https://cdn.jsdelivr.net/gh/highlightjs/cdn-release@11.4.0/build/languages/go.min.js"></script> <script> hljs.highlightAll(); </script> </body> </html>