shaynak / taylor-swift-lyrics

Scraping all Taylor Swift lyrics!
MIT License
35 stars 16 forks source link

Removal of special characters #2

Closed jimmynotjames closed 5 months ago

jimmynotjames commented 6 months ago

Noticing that in lyrics.json, there are the following:

\u0435 is a Cyrillic small letter "е" (U+0435). Example: "I park\u0435d my car right between the Methodist" lyrics.json has 410 of these. I've been seeing these chars and filtering them on my end for over a year now, so it's just not this recent album release.

\u200b is a zero-width space and it's weirdly hanging out in two song titles, which is messing with my data ingestion: Two instances in lyrics.json "l\u200bong story short" "r\u200bight where you left me" I saw that you used strip() to remove them from the edges but the above two have chars in the middle.

I actually have commits ready that I think fix the above things, along with an update for requirements.txt, if you want to give me permissions to push PR's to your repo.

jimmynotjames commented 6 months ago

Oh, and I'm not actually sure how contributing to another GitHub repo works, so if I do have permissions and I missed something, let me know! I'm used to using GitHub Enterprise, and not sure about public GitHub repos.

shaynak commented 6 months ago

Hi! You're welcome to create a pull request on this repo - I'll approve it and merge it in after taking a look.

In order to create a pull request, you'll need to fork the repository, commit to your fork, and then create a pull request from the "Pull requests" page on your fork.

See this page for more info!

I had been handling special characters on the frontend of the app I ingest this data into, but it's definitely better to handle them on the backend :)

jimmynotjames commented 6 months ago

Ahhh, lol, interesting! Yes, I've been handling these special chars manually via search/replace in a text editor and via my 'front end', which is outputting this to a .tex file for LaTeX for typesetting.

shaynak commented 5 months ago

Lyrics should be updated FYI!