CloudFlare protection on spotifycharts.com

kelvingakuo / fycharts

Unofficial Spotify Charts API. Get any and all data for top 200 and viral 50 music on Spotify. 27th Apr 2021 Update - fycharts returns empty dataframes due to CloudFlare protection on spotifycharts.com

https://pypi.org/project/fycharts/

MIT License

53 stars 8 forks source link

CloudFlare protection on spotifycharts.com #6

Open kelvingakuo opened 3 years ago

kelvingakuo commented 3 years ago

spotifycharts.com now has CloudFlare DDOS protection on its /download page, which leads to fycharts generating an empty dataframe for any date-chart-region combination.

Idea: Use cloudscraper

kelvingakuo commented 3 years ago

Welp! cloudscraper needs to be paid for. Implement a custom solution? Scrape spotifychart.com's HTML? Find the playlists on Spotify via API?

czoka commented 3 years ago

@kelvingakuo through their website you can only find Top 50 not Top 200 playlists. Do you have any idea how to get URLs for Top 200 playlists?

kelvingakuo commented 3 years ago

They don't publish top 200, so no way to get those @czoka.

gitstelle commented 3 years ago

hey! I just ran this code into jupyter lab and it returned a table with the charts? I didn't have to pay anything for cloudscraper, but this is my first time using it. I don't really know how fycharts works but was wondering if cloudscraper really could be used to fix the issue. thanks

import cloudscraper

scraper = cloudscraper.create_scraper() # returns a CloudScraper instance r=scraper.get("https://spotifycharts.com/regional/no/daily/2020-04-17") df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list df = df_list[0] df.head()

kelvingakuo commented 3 years ago

Cloudflare protection is on the /download page of spotifycharts.com.

Go to spotifycharts.com
At the top right, click on Download To CSV
It'll open a new page with CloudFlare checks, then will download a CSV

fycharts downloads the CSV file returned by that page. Your code works because it parses the HTML directly. cloudscraper isn't needed for this, actually. It looks like this is the best way forward

czoka commented 3 years ago

@esthoop I was doing the same. Reading and processing the html page, but after a time the IP of the server was probably flagged and now I'm getting the same CloudFlare protection there too, not only for the /download page. I suggest you don't base your project on this solution cause mine failed because of it. And it's an inconsistent time and number of requests till protection kicks in. One server ran for a month the other a few days and my dev machine is still not blocked although running tests means I'm requesting as much data in a day as the server was doing in about 2 weeks.

czoka commented 3 years ago

@kelvingakuo I know there are not displayed in search results and they are not publicly shared, but looking at it from a developer standpoint they must have a top 200 playlist similar to the ones with 50 songs and spotifycharts is just a history log of that playlist. Or maybe the top 50 is a truncated version of the top 200 that only specific user right owners can access completely. These are speculations only and I don't have any proof, but I think this would have been the easiest way to implement it.

kelvingakuo commented 3 years ago

I had a theory that they'll also make scraping the HTML hard. Looks like they already did that! It doesn't make sense to me why they would deprecate the charts API from the endpoint, and then add protections to the only other option. SMH

Regarding the Top 200, I also think the Top 50 is a truncation of the Top 200. But this gets updated daily, so how would we go about generating our own/ figuring out the Top 200 @czoka ?

czoka commented 3 years ago

@kelvingakuo currently I'm scraping the page using a custom made Chrome Extensions. This lets the page load all the content including dynamic javascript loaded elements, then it reads the relevant data from the page. EDIT This is sadly not a viable solution for this repository.

kelvingakuo commented 3 years ago

fycharts downloads the CSV via the download button. The CSV contains the Spotify URI of each track. From there, I extract the ID

On Sun, Jun 13, 2021 at 10:15 PM Matt Beeman @.***> wrote:

@kelvingakuo https://github.com/kelvingakuo How does fycharts get the spotify id for each of the tracks from the charts page? I looked through the html and don't see any id associated with each of the track's, just the information we can already see on the page like rank, song, artist, and number of streams. Thank you in advance!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kelvingakuo/fycharts/issues/6#issuecomment-860257316, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4BOVJOFM42NQ26EFZGCD3TST7UDANCNFSM43KBTV7A .

ooyamatakehisa commented 3 years ago

@kelvingakuo I found the way to get spotify_ids from the global chart! https://github.com/spotify/web-api/issues/33#issuecomment-444423247 As this issue says, I requested https://api.spotify.com/v1/playlists/37i9dQZEVXbMDoHDwVN2tF (37i9dQZEVXbMDoHDwVN2tF is the playlist id for the global chart) and I can get the responce!

I implemented it here (https://github.com/ooyamatakehisa/bpm-searcher/blob/main/interactor/ranking_interactor.py) and you can check the result here (https://bpm-searcher.herokuapp.com/api/v1/ranking or https://bpm-searcher.herokuapp.com/). You can also refer to this README. I hope this helps you!