Open kelvingakuo opened 3 years ago
Welp! cloudscraper needs to be paid for. Implement a custom solution? Scrape spotifychart.com's HTML? Find the playlists on Spotify via API?
@kelvingakuo through their website you can only find Top 50 not Top 200 playlists. Do you have any idea how to get URLs for Top 200 playlists?
They don't publish top 200, so no way to get those @czoka.
hey! I just ran this code into jupyter lab and it returned a table with the charts? I didn't have to pay anything for cloudscraper, but this is my first time using it. I don't really know how fycharts works but was wondering if cloudscraper really could be used to fix the issue. thanks
import cloudscraper
scraper = cloudscraper.create_scraper() # returns a CloudScraper instance r=scraper.get("https://spotifycharts.com/regional/no/daily/2020-04-17") df_list = pd.read_html(r.text) # this parses all the tables in webpages to a list df = df_list[0] df.head()
Cloudflare protection is on the /download
page of spotifycharts.com.
Download To CSV
fycharts downloads the CSV file returned by that page. Your code works because it parses the HTML directly. cloudscraper isn't needed for this, actually. It looks like this is the best way forward
@esthoop I was doing the same. Reading and processing the html page, but after a time the IP of the server was probably flagged and now I'm getting the same CloudFlare protection there too, not only for the /download page. I suggest you don't base your project on this solution cause mine failed because of it. And it's an inconsistent time and number of requests till protection kicks in. One server ran for a month the other a few days and my dev machine is still not blocked although running tests means I'm requesting as much data in a day as the server was doing in about 2 weeks.
@kelvingakuo I know there are not displayed in search results and they are not publicly shared, but looking at it from a developer standpoint they must have a top 200 playlist similar to the ones with 50 songs and spotifycharts is just a history log of that playlist. Or maybe the top 50 is a truncated version of the top 200 that only specific user right owners can access completely. These are speculations only and I don't have any proof, but I think this would have been the easiest way to implement it.
I had a theory that they'll also make scraping the HTML hard. Looks like they already did that! It doesn't make sense to me why they would deprecate the charts API from the endpoint, and then add protections to the only other option. SMH
Regarding the Top 200, I also think the Top 50 is a truncation of the Top 200. But this gets updated daily, so how would we go about generating our own/ figuring out the Top 200 @czoka ?
@kelvingakuo currently I'm scraping the page using a custom made Chrome Extensions. This lets the page load all the content including dynamic javascript loaded elements, then it reads the relevant data from the page. EDIT This is sadly not a viable solution for this repository.
fycharts downloads the CSV via the download button. The CSV contains the Spotify URI of each track. From there, I extract the ID
On Sun, Jun 13, 2021 at 10:15 PM Matt Beeman @.***> wrote:
@kelvingakuo https://github.com/kelvingakuo How does fycharts get the spotify id for each of the tracks from the charts page? I looked through the html and don't see any id associated with each of the track's, just the information we can already see on the page like rank, song, artist, and number of streams. Thank you in advance!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kelvingakuo/fycharts/issues/6#issuecomment-860257316, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4BOVJOFM42NQ26EFZGCD3TST7UDANCNFSM43KBTV7A .
@kelvingakuo I found the way to get spotify_ids from the global chart!
https://github.com/spotify/web-api/issues/33#issuecomment-444423247
As this issue says, I requested https://api.spotify.com/v1/playlists/37i9dQZEVXbMDoHDwVN2tF
(37i9dQZEVXbMDoHDwVN2tF is the playlist id for the global chart) and I can get the responce!
I implemented it here (https://github.com/ooyamatakehisa/bpm-searcher/blob/main/interactor/ranking_interactor.py) and you can check the result here (https://bpm-searcher.herokuapp.com/api/v1/ranking or https://bpm-searcher.herokuapp.com/). You can also refer to this README. I hope this helps you!
spotifycharts.com now has CloudFlare DDOS protection on its /download page, which leads to fycharts generating an empty dataframe for any date-chart-region combination.
Idea: Use cloudscraper