johnwmillr / LyricsGenius

Download song lyrics and metadata from Genius.com 🎢🎀
http://www.johnwmillr.com/scraping-genius-lyrics/
MIT License
912 stars 161 forks source link

Added get_referents() and get_song_annotations() functions #102

Closed ludehon closed 5 years ago

ludehon commented 5 years ago

get_song_annotations() can be used for getting all song's annotations from a song id. It returns a list of tuple : (fragment, list_of_annotations). Each fragment correspond to a highlighted part of the song, where people can write annotations. Resolve #100

ludehon commented 5 years ago

By the way, @johnwmillr, is it possible to permanently remove #101 ?

johnwmillr commented 5 years ago

Thanks for the new PR. I don't think I'm able to delete PRs, but this StackOverflow link says you can contact GitHub support and they'll delete the PR if it contains sensitive information.

ludehon commented 5 years ago

Hi again @johnwmillr, I was wondering how I could design some tests functions. The problem is the content I get from the Genius API is not fixed and therefore tests can fail in the future. I was suggested by someone I could use a locally stored result from the API, and tests my functions with that, but it imply storing data files for tests purpose. What do you think ?

johnwmillr commented 5 years ago

Thanks for keeping to work on this, @ludehon! Could you post some examples here of the queries you're sending to your get_referents and get_song_annotations functions?

Does the content from Genius change because users upload new annotations? If a list is returned for a particular song, would the first (or last?) element of the list always be the same?

If there is date metadata associated with annotations, can you sort by date, ensuring that the earliest post will always be first?

jessestuart commented 5 years ago

Hey all! Been interested recently in automating the construction of a corpus based on Genius data, came across this lovely project. This PR caught my eye, so I thought I'd chime in:

@johnwmillr: 1) I wouldn't hold my breath on getting a PR permanently deleted; GH's collaboration primitives are designed to be immutable, and with good reason. However, you can easily check the PR back out (looks like the branch was deleted, but if you recreate it should stick track), perform an interactive rebase to drop any commits with sensitive data, then force push the branch, leaving no record of the mistaken commits on the Github PR page. Here's roughly how I'd go about it:

$ git clone johnwmillr/LyricsGenius && cd LyricsGenius
# Fetch a reference to the original branch, so we can wipe its history.
$ git fetch origin refs/pull/101/head:annotations
#                            ^ PR#    ^ Your original fork's branch name
$ git rebase -i HEAD~8
#               ^ This is simply saying that we're going to rewrite history
#               going back 8 commits, in this case. Adjust as you see fit...
#               doesn't much matter since you're wiping history anyway, so long
#               as you get everything.

Drop any/all commits you don't want public, then git push -f <your-remote> should do the trick. Let me know if you'd like clarification on any of those steps.

2) There are plenty of great tools for doing exactly what you're looking for β€” the problem of "not-fixed test data" is a tale as old as QA engineering :P Many people's approach to simplify this is by taking the variability of the API out of the equation β€” e.g., by making an initial request during testing, automatically recording the request params + response, serializing that to disk, and replaying it back on subsequent test runs. That way you can focus on whether there's a regression in your logic when a test fails, rather than wondering about API flakiness, etc. I've used tools like this in a variety of languages, but not Python, although https://github.com/kevin1024/vcrpy seems to be in line with what you're looking for.

johnwmillr commented 5 years ago

Wow, thank you for this, @jessestuart! These comments are a huge help. I'll try to get around to integrating all of this in the next few days.

ludehon commented 5 years ago

@johnwmillr Sorry for the late response, here is an example of the functions :

To access song's annotations (you need the song's id) :

song_id = 873

g = Genius("access_token")
annotations = g.get_song_annotations(song_id)

for a in annotations:
    print(a)

Does the content from Genius change because users upload new annotations?

In my opinion, since almost anyone can add annotations, any result content or length can change. And annotations can be added to any fragment of the song, but it seems they may be appended to the end of the annotations list (it is not specified by Genius).

If there is date metadata associated with annotations, can you sort by date, ensuring that the earliest post will always be first?

I searched in the returned JSON and only found two timestamps at referent["annotatable"]["client_timestamps"] For example :

referents = g.get_referents(song_id)
print(referents["referents"][0]["annotatable"]["client_timestamps"])

However I don't know how to use these timestamps : {'updated_by_human_at': 1556991479, 'lyrics_updated_at': 1560888793} (didn't find infos on it).

Afaik there is no way to sort by date.

Perhaps we can test the first annotation of a famous song that will not likely to be changed.

johnwmillr commented 5 years ago

@ludehon, thanks for putting this together! I've cleaned up the methods and added tests with what you've put together. I'm relying on consistent responses from the API for the tests and am happy with that method until it breaks... πŸ™„ Going forward, I'd love to have tests using the vcrpy module like @jessestuart suggested.

ludehon commented 5 years ago

@johnwmillr I was kinda busy lately, I will have more free time now. I will look into vcrpy for consistent tests !

Should I delete this fork and create an other one for tests ? I'm not sure about what to do...

johnwmillr commented 5 years ago

Yes, your code will be more consistent if you create pull in changes from the master branch. You shouldn't have to create a new fork to do this, but that may be the easiest route. It'd be great if you could work on the vcrpy tests!

ludehon commented 5 years ago

Hi again @johnwmillr, I looked at vrcpy and tested the vcr decorator for consistent tests. It seems to work well since it stores cache data in files (JSON response).

I pushed changes on my fork, tell me if it's ok.

johnwmillr commented 5 years ago

Yes, this looks like a good step forward. Are the VCR cache files stored locally? They're copies of the network traffic?

ludehon commented 5 years ago

By using @vcr.use_cassette() without filename in parameter, the file is stored locally next to the script that calls the decorated function. For example, when testing @vcr.use_cassette() def test_get_song_annotations(self): the file _/tests/test_get_songannotations is produced, and contains the HTTP GET data received (see attached file, I had to change extension because of github).

johnwmillr commented 5 years ago

@ludehon, you can go ahead and open a PR that adds vcrpy functionality.