rembo10 / headphones

Automatic music downloader for SABnzbd
GNU General Public License v3.0
3.41k stars 600 forks source link

musicbrainz issue? #445

Closed dny238 closed 12 years ago

dny238 commented 12 years ago

Am I the only one with issues? Worked yesterday just fine...

2012-01-08 18:15:09 INFO Fetch failed, try refreshing. (1c3c7d51-4679-476c-8cb0-580a35da018b) is already in the database. Updating 'have tracks', but not artist information

2012-01-08 18:15:03 WARNING Attempt to query MusicBrainz for Saving Abel failed: HTTP Error 503:

FrAllard commented 12 years ago

Trying Headphones for the first time and I have the same issue...

I'm on the latest update proposed by Headphones when I started it.

I'm on Windows if that matters.

dny238 commented 12 years ago

Must be something new with MusicBrainz. Bummer.

dny238 commented 12 years ago

2nd question: Did it work for a while, then stop? I was wondering if they banned me because I had it process alot of music in a short period of time.

FrAllard commented 12 years ago

I think that's the case... I too started to fill the database with what I already have... Then I saw that warning... Looking at the issues I found the following link that describe the rate limiting imposed by MusizBrainz. http://musicbrainz.org/doc/XML_Web_Service%2FRate_Limiting

dny238 commented 12 years ago

Seems like the system is throttling the requests to every 5 seconds. Do you think the useragent missing or been banned? ???

09-Jan-2012 10:13:27 - WARNING :: CP Server Thread-10 : Attempt to retrieve information from MusicBrainz for release group "30890247-9867-44ce-ae0e-091dbfe935e2" failed. Sleeping 5 seconds

WARNING:headphones:CP Server Thread-10 : Attempt to retrieve information from MusicBrainz for release group "30890247-9867-44ce-ae0e-091dbfe935e2" failed. Sleeping 5 seconds

09-Jan-2012 10:13:32 - WARNING :: CP Server Thread-10 : Attempt to retrieve information from MusicBrainz for release group "30890247-9867-44ce-ae0e-091dbfe935e2" failed. Sleeping 5 seconds

WARNING:headphones:CP Server Thread-10 : Attempt to retrieve information from MusicBrainz for release group "30890247-9867-44ce-ae0e-091dbfe935e2" failed. Sleeping 5 seconds

dny238 commented 12 years ago

After leaving the system of for about 30 minutes, I got an idea and changed the user agent Seems to be working again for now... (see the 2 below)

def _openUrl(self, url, data=None): userAgent = 'python-headphones2/' + musicbrainz2.version req = urllib2.Request(url) req.add_header('User-Agent', userAgent) return self._opener.open(req, data)

If you can confirm that this works for you, then it seems like the main code needs to be tweaked to process your existing library slower. I'm sure a few well placed 'sleep' functions would do the job.

FrAllard commented 12 years ago

It will probably get banned too, I'm trying to setup a local mirror, that would solve my problem... The problem is that they supply a VirtualBox image for easy setup, but I don't use VirtualBox, I'm working to transfer it to a HyperV image.

dny238 commented 12 years ago

we shouldn't have to do that. 10 queries a second seems like a lot to me. The code should rate limit itself better so this doesn't happen. I'll let you know if I get banned.

we should leave this issue open for him to fix.

FrAllard commented 12 years ago

That tell me tha MusicBrainz just banned the User-Agent used by Headphones. They found that the users of Headphones where using too much processing power and they shut us down... The way to go is running a local replication server of MusicBrainz, that way you'll never get banned! You receive hourly updates from the site using their VM.

joezorry commented 12 years ago

If I'm not entirely wrong here, if everyone had their own unique userAgent it's going to work for that person. In musicbrainz web service rules it states that a CLIENT should not make a request more than one time per second, but since headphones uses the same userAgent musicbrainz thinks this is from the same client. This is the problem, no?

Is it morally wrong to randomize a userAgent? I'm a beginner in this sort of thing so it is a genuine question.

FrAllard commented 12 years ago

No all Headphones users should use the same User-Agent... Imagine, every Firefox have the same User-Agent! They just banned the User-Agent that Headphones uses by default, because the software hammer their servers too much probably...

joezorry commented 12 years ago

Thanks for the response. But then the problem can only be fixed by cooperating with musicbrainz or in some way set up a similar service (by our suggestion through cloning), or maybe limit the queries for every user which just pushes the problem further away. Or is there another way.

dny238 commented 12 years ago

MusicBrainz bans on a combination of User-Agent and IP address.

FrAllard commented 12 years ago

They also have a global limit of 2 500 query in 10 seconds... But I doubt that it's the problem right now for Headphones users...

Quote from their site: Global rate limit: If the total number of requests coming in to MusicBrainz exceeds our global rate limit in 10 seconds, all requests are rejected with a 503 error. This continues until the total count of requests in the last 10 seconds drops below the global rate limit. The current rate limit is set at 2,500 requests per 10 seconds. (equivalent to 250 requests per second)

dny238 commented 12 years ago

Ok, just tested it. I agree they've banned the user-agent for everyone I setup a new install on my own computer at the office, new IP.

rembo10 needs to contact MusicBrainz about fixing it, in the meantime you can hack the user-agent as described above to get around the problem, or try to get your own mirror setup as MageMinds is discussing. Presumably there will be more to do since the app is hardcoded to go to the real MusicBrainz servers and you'll have to override DNS.

09-Jan-2012 14:00:33 - INFO :: Thread-12 : Now adding/updating: Pink 09-Jan-2012 14:00:33 - WARNING :: Thread-12 : Attempt to retrieve information from MusicBrainz for release group "771eba59-271c-4724-b952-7a0ce3ffac7c" failed. Sleeping 5 seconds 09-Jan-2012 14:00:39 - INFO :: Thread-12 : Unable to get release information for Cyber - there may not be any official releases in this release group 09-Jan-2012 14:00:39 - INFO :: Thread-12 : Updating complete for: Pink 09-Jan-2012 14:00:39 - WARNING :: Thread-12 : Attempt to query MusicBrainz for The Black Eyed Peas failed:

markzw commented 12 years ago

Where can I find the code where the userAgent is defined?

Just changing the text will do the trick?

dny238 commented 12 years ago

Yes, if fixes it.

/lib/musicbrainz/webservice.py

def _openUrl(self, url, data=None): userAgent = 'python-headphones2/' + musicbrainz2.version req = urllib2.Request(url) req.add_header('User-Agent', userAgent) return self._opener.open(req, data)

markzw commented 12 years ago

Thanks. I found it yesterday and noticed some thing that rembo10 might want to change in the next release:

  1. When users start filling their database, they'll most likely generate a use database with wanted album. This will cause huge amounts off (recurring) traffic on the MusicBrainz Webservice as on the Webservices of the search providers. I would control this traffic by spreading the searches better and maximizing the number of automated searches per batch.
  2. When you add an album, HP will first search the providers for NZB's and afther that return the results to the webinterface, which coses long wainting times. Why not just change the album to 'wanted' and let a bot in the background search for the NZB's and change the status to 'snatched' when found?
dny238 commented 12 years ago

might be worth putting in new feature requests for those.

it doesn't seem like the system cache's any of the requests to MusicBrainz. It's possible that a simple caching proxy on the internet that he sponsers would stave off the load.

I can imagine that the nature of the project might not gain it the kind of support from MusicBrainz that we'd appreciate.

FrAllard commented 12 years ago

Just so you know I tested a local mirror of MusicBrainz and it doesn't work, it seem the supplied api is not exactly the same as the original site. HP was getting errors when accessing my musicbrainz server... I didn't look at what is the problem though, I just remove my dns entry to point back to the original site and stop my local copy.

tfcollins commented 12 years ago

I also am getting a
"Attempt to retrieve information from MusicBrainz for release group "35bc29f9-4c09-4d07-840a-01ce4b2d4865" failed. Sleeping 5 seconds" for every album i try to process.

janww commented 12 years ago

Changing the userAgent and getting a new IP definitely fixes this. But you should try to seek a solution that sticks to the rules of Musicbrainz...

BTW: y u no using discogs? :-)

dny238 commented 12 years ago

This user-agent thing is simply a work around, and a bug report still. Is this project under active development still? It's very promising.

Dark2004 commented 12 years ago

Same painfull issue : 2012-01-16 10:28:12 WARNING Error fetching artist info. ID: 3414d446-735a-443c-931f-10634f57e5b9 2012-01-16 10:28:07 WARNING Attempt to retrieve artist information from MusicBrainz failed for artistid: 3414d446-735a-443c-931f-10634f57e5b9. Sleeping 5 seconds

Hope rembo will find a quick solution for that

ianmcorvidae commented 12 years ago

I'm loosely affiliated with musicbrainz; to give you guys an idea of the issue here, the requests coming from Headphones were causing a number of other things, including the musicbrainz web interface (where people enter the data you use!) to give similar errors, not to mention any other application using the musicbrainz webservice. Thus, you got throttled. Not blocked, note, although initially there was a full-fledged block during some hours of the day, but simply rate-limited. However, I suspect most users of headphones see it as a block, since the number of requests to the musicbrainz webservice headphones generates is rather exorbitant.

There's a few things you can do to help fix this.

One, as mentioned in this thread, is caching; from what I can figure out, every headphones install tries to refresh every artist that's been added to it, once a day. Keeping this data a little longer in the application would help a lot (perhaps simply allowing manual updates at a button-click would remove the need for automatic updating of everything?), as would a caching proxy of some sort.

Two, local musicbrainz-server instances should work fine, although musicbrainz-server can be finicky to set up, and if you're using search queries, you'll still likely get rate-limited when you hit search.musicbrainz.org, unless you set up a search server as well. #musicbrainz-devel on freenode is usually available to help with server setup if there's confusion. I'm not sure what MageMinds was getting that was off, but nothing should be -- as I said, come ask if something's off.

Three, requests should probably be spread out more through the day; while this won't necessarily reduce the number of requests (see caching for the quickest way to do that), it'll mean there's a more even distribution -- which, with a certain number of hits per a given time period, obviously means more things make it through! You'll also see there's a pretty strong spike pattern in the graphs I link below -- clearly headphones could do something to improve this shape.

Four, figure out ways to just plain make fewer requests, if you can. I haven't looked through your code (although it is in fact on my list of things to do -- but that list is long, don't count on it happening soon!) but it's likely there's improvements that could be made.

Five, while harder to make happen, donations to MetaBrainz (the nonprofit that runs musicbrainz) do tend to lead to larger capacity generally. There have been discussions of ways to reward specific applications whose userbases donate more, but nothing has been implemented yet.

As for those graphs: Before we rate-limited python-musicbrainz/0.7.3 separately (later, the headphones UA was added to that same rate-limit; other things using python-musicbrainz/0.7.3 aren't substantial issues), our global rate-limit graph: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?View=-1&Template=1262469543.28951&Base=%2Fvar%2Fwww%2Fmrtg%2F%2Flenny_ratelimit-default-wsglobal-count.rrd&Start=end+-+2+months&End=now+-+1+month&Mode=view

The overall python-musicbrainz/0.7.3 rate-limit: http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Mode=view;Template=1262469543.28951;Base=%2Fvar%2Fwww%2Fmrtg%2F%2Flenny_ratelimit-default-wsuapy73-count.rrd

The python-headphones/0.7.3-specific rate-limit (which leaves actual rate-limiting to the above rate-limiter, hence no 'Refused'), just to show how much of our traffic is from headphones (noting, of course, that older versions of headphones without the new UA won't appear here): http://stats.musicbrainz.org/mrtg/drraw/drraw.cgi?Template=1262469543.28951&Base=%2Fvar%2Fwww%2Fmrtg%2F%2Flenny_ratelimit-default-wsuaheadphones73-count.rrd&Mode=view [edit: correct URL]

Let me also note again (I commented on an earlier Headphones issue/pull request) that ws/1, which is what you're using, is officially deprecated and you might want to update to using ws/2. python-musicbrainz-ngs here on github (and musicbrainzngs on pypi) is a way to do this; beets, which I see in your source tree, has also updated to use it.

rembo10 commented 12 years ago

Finally have some free time to fix this issue. Expect some updates over the next few days.

NoGood commented 12 years ago

Thanks Rembo10! Glad to see you have found some time.

The option to add your own list of backup Musicbrainz servers would be great.

Also i think it would be wise to remove the daily check as it generates a lot of traffic. Perhaps it could be changed to a monthly check instead.

rembo10 commented 12 years ago

All good ideas. Maybe a global monthly update, and daily or weekly update if there is a change in the number of release groups or something. Def need to sort it out today

On Jan 22, 2012, at 11:09 AM, NoGoodreply@reply.github.com wrote:

Thanks Rembo10! Glad to see you have found some time.

The option to add your own list of backup Musicbrainz servers would be great.

Also i think it would be wise to remove the daily check as it generates a lot of traffic. Perhaps it could be changed to a monthly check instead.


Reply to this email directly or view it on GitHub: https://github.com/rembo10/headphones/issues/445#issuecomment-3602329

rembo10 commented 12 years ago

Some quick fixes that I've just pushed out: -MB updates are no longer on a fixed time cron job (now an interval job based on when the program is started). Hopefully this will spread the requests out. -Spread out the updates to 48 hours (will spread this out further if it continues to be a problem) -Added a donate to MB link at the bottom.

I'm going to keep working on this, to reduce the number of hits to MB, so instead of a full update, you'll see artists updating only if there is a change in the # of release groups, or if there is an upcoming album, but hopefully this will reduce the load, and the 4am spikes.

avjui commented 12 years ago

Perhaps it will fix the Problem when every user must create a account (i think it´s free) and you make a option to set this in headphones.

Dark2004 commented 12 years ago

Honestly this issue is more and more painfull...I have no update for each artist i add maybe a solution is to have a local database of MB which is replicated with the only version once per week and headphones will focus only on the local MB database....

Rembo10 I really hope you will find a solution for this issue and if you need a tester just let me know.

FrAllard commented 12 years ago

This mean that HP would have to do the process of replication by itself... Download the full dump from MB the first time, then apply replication packets. But I think you need a ProgreSQL server to follow the replication packets... That would be a great solution and a fast one since the database is local, but that would require a lot of coding into HP to it to work.

Dark2004 commented 12 years ago

Yes but i think this solution will prevent from this horrible message: WARNING Attempt to retrieve artist information from MusicBrainz failed for artistid: 4593d49a-7f67-46ba-9ec0-126bd676286f. Sleeping 5 seconds

I can only propose my help for the testing part as i am not a expert in developpement.

Rembo10 what do you think?

dny238 commented 12 years ago

If you aren't patient for a real solution. Change the user-agent to your name, or something unique

rembo10 commented 12 years ago

Are you talking about keeping a local copy of the mb database? Or about having a dedicated headphones server that everyone will use?

There are some other solutions to this - like getting timestamps of when MB data was last modified - that I'm working on getting in there to cut down on the amount of requests headphones makes

dny238 commented 12 years ago

why not just use a local cache of the data you are looking up and then you can control the age of the files in the cache. You'd have less work to do and could replace the cache with something smarter or distributed later.

NoGood commented 12 years ago

Thanks again Rembo for your time. I see less time outs in the log already. Could also be that HP is just quering less in the time that I let it run. MusicBrainz should implement an API and ban only those who query to often instead of banning the service. I guess I will ask them (too)

sbuser commented 12 years ago

I posted a feature request/issue that I think could help reduce traffic significantly to musicbrainz while also providing headphones users more flexibility when it comes to their collections. Let me know what you think: https://github.com/rembo10/headphones/issues/463

Dark2004 commented 12 years ago

@Rembo10, I'm talking to have a local copy of the MB Database and replicate data once a week. But it's cleared that the best solution is to have a dedicate Headphones server that everyone can be used (but this solution can cost a little bit money but i am sure that all headphones users will participate...).

@dny238, Could you please explain me how you have changed the user-agent used by headphones (as i have created a own account in MB)?

thanks in advance for you revert.

dny238 commented 12 years ago

Dark, Look above. January 10th webservice.py...

NoGood commented 12 years ago

Rembo, your update works perfectly. The backup MB server is probably going to get hammered but still its a good temporary solution.

rembo10 commented 12 years ago

Got an "official headphones mirror" last night and finally figured out how to set it up :-). You can use that mirror freely - just set mirror to "headphones" in the config