microsoft / appcenter

Central repository for App Center open source resources and planning.
https://appcenter.ms
Creative Commons Attribution 4.0 International
1.01k stars 225 forks source link

CodePush servers slow and unstable #814

Open boennemann opened 5 years ago

boennemann commented 5 years ago

What App Center service does this affect? CodePush

Describe the bug As described in https://github.com/microsoft/react-native-code-push/issues/1646 https://github.com/microsoft/react-native-code-push/issues/1641 the CodePush servers are currently extremely slow and returning server errors randomly, breaking the app start up for many workflows.

I'm reopening this issue here, because I fear that the react-native-code-push issue tracker is currently ignored (See warning on top of the README).

To Reproduce See issues above.

Expected behavior Consistent and fast server responses.

Additional context This is a server side problem.

alexcroox commented 5 years ago

Hopefully the +1s on the parent above are enough to show it's effecting others but I too am seeing sporadic 403s and long "checking for update" delays in an app I'm about to release.

ghost commented 5 years ago

Is this being looked into by someone on the Appcenter team? This is an ongoing problem for us. Happy to help in any way I can.

boennemann commented 5 years ago

It looks like @blparr assigned @botatoes, so it seems like it's not been fully ignored at least … ? I wouldn't be too mad about at least a statement here though, as this is a massive server problem breaking real apps out there.

I realize that CodePush is a free service and that no one is obliged to provide anything here, but at least some clarity and transparency would be highly appreciated.

This is what you get for relying on free services ¯\(ツ)\/¯ CodePush is amazing technology and becoming super critical for my workflows, so I'd rather pay than have the current situation.

Again. Any response would be highly appreciated. Even a #wontfix, so we can at least remove CodePush from our apps and look for alternatives.

blparr commented 5 years ago

Hi there, we try to respond as many issues as we can, but unfortunately we can't always respond to all - I asked my colleague @botatoes to follow up with you.

Thanks!

botatoes commented 5 years ago

Hi All, our team is currently working on fixing some fundamental issues with CodePush that has historically made it hard for us to hold CodePush to the same bar we hold the rest of App Center services. This is the first step to allow us to support more React Native in App Center and this new version we are working on will resolve some of the most common performance issues we've historically seen. Unfortunately, this also meant we pulled off our devs that were monitoring issues to be full time focused on these changes.

That being said, I will investigate this issue some more and see if we need to prioritize it over our current set of work. We are halfway done with the new version so please bear with us in the meantime.

boennemann commented 5 years ago

Thank you for your response!

Hi there, we try to respond as many issues as we can, but unfortunately we can't always respond to all - I asked my colleague @botatoes to follow up with you.

I fully understand that. This is however not a regular "issue", but more like a system outage. In this case it's even worse, because instead of immediately returning a 500 error (the app could quickly go on from here), your servers are doing something for up to half a minute. Just yesterday we had a record breaking measurement of 28seconds for checkUpdate.

Our preferred workflow would be to block the entire app until checkUpdate was successful, so we can immediately apply updates marked as mandatory. With the current behavior that's not only impossible, but as said it's blocking every single app start for around half a minute. There is also no configurable timeout option in the client.

I will investigate this issue some more and see if we need to prioritize it over our current set of work

Thanks for looking into this. I'd strongly suggest giving this the highest priority, because – again – this feels more like a service outage to me.

Please forgive me my insistence & frustration here – I really appreciate the service you're providing and how CodePush has tremendously simplified my life as react-native developer.

If there is any insight or more specifics we can share please let us know. I'd be happy to provide you with as much information as possible.

Thank you very much for getting back to us and looking into the problem.

Best, Stephan

botatoes commented 5 years ago

@boennemann Are you available to get on a call with me and our CodePush devs so we can get a sense of the issue a bit better? I can send the invite to the email on your github profile.

boennemann commented 5 years ago

@botatoes Thanks for the invite. I will respond to that email now.

For the record, I'm starting to see an improvement in loading times, albeit not for all requests. Anyone else?

ghost commented 5 years ago

Thanks for looking into this @botatoes. I just tested the checkUpdate request ten times. This was done on an iPhone 6s (iOS 12) running a debug build of our app. Times are in seconds:

25 12 13 17 10 12 12 19 17 14

Anecdotally, I don't think we're seeing any improved performance. Thanks again for looking into this and let me know if I can be of any additional help.

davidgruebl commented 5 years ago

Also just tested checkUpdate 10 times in a row. Measured results in between 202 milliseconds and 26.26 seconds. Three requests resulted in a 403 forbidden.

I really appreciate you looking into this @botatoes.

lklepner commented 5 years ago

We are also experiencing a high rate of extremely slow codepush responses and have implemented a 1.5 second timeout to avoid having users wait excessively long for an update while booting the app.

Below is a chart of the percentage of users which fail to obtain metadata in this time, which in some cases is as high as 100% image

The underlying data can be reviewed here - https://docs.google.com/spreadsheets/d/1sgpJTTmaTkaKM7L6GSUb_6cANBRJrZLVqdYpkg0WF4g/edit?usp=sharing

We have been considering putting a proxy between our app and the codepush metadata service as a means of providing a faster more reliable responses to our users.. but of course we'd prefer not to go this route.

@botatoes Would it be possible for us to join the conference call?

codeflows commented 5 years ago

We started tracking CodePush update duration last Thursday after noticing it's been getting slower and slower. Median latency for the check is 1,5s but every day between approximately 12:00-16:00 (noon-4pm) UTC the median latency jumps to ~5s.

Screenshot 2019-08-12 at 12 16 31

lklepner commented 5 years ago

@botatoes Have you had a chance to discuss this problem with the codepush developers? If so are there any updates you can share with us?

ghost commented 5 years ago

@botatoes Has there been any progress on fixing this issue, or allocating resources to look into the problem? Thank you.

ekrapfl commented 5 years ago

I would just like to put in that I am investigating using CodePush for my company for an Ionic/Cordova app, and I am seeing the same issues pop up fairly frequently (really slow update checks, and frequent 403 errors). The service is pretty awesome, but it is a non starter if we have this sort of performance and reliability issues.

Is the major upgrade that is being worked on planning on addressing these sorts of issues, and is there any ballpark ETA on that?

Thanks!

JakubOleksy commented 5 years ago

@bluto56's team is currently working on this and he can offer a bit more on timeline, we're getting close.

ghost commented 5 years ago

Would it be possible to get an update on this in the next few days? Really appreciate you all looking into this and I understand if the resourcing isn't there to fix the stability issues at the moment. Just need some transparency to plan accordingly on how we will use Code Push. If you are planning to sunset the current implementation and no longer fix bugs, I think the right thing to do is announce that to the community.

botatoes commented 5 years ago

Quick update on this. We believe this problem is ultimately caused by the legacy service we are trying to deprecate. The problem mostly surfaces during heavy traffic hours. Our current set of work is to make a new CodePush service that is more scalable and resolve issues around uploading and latency.

We've completed work around uploads and CLIs for the new service. We are currently working on changes to the Cordova and React Native SDK to target the new service. After the SDK work, we have some work around migrating existing CLI / SDK calls to the new service and work around deprecating the legacy service. Once that's done, we'll roll out the new service and we believe it'll resolve this slow and unstable issue everyone on this thread is experiencing.

To be completely transparent, we have decided not to put a bandaid on the legacy service and to focus on completing the new service to get it out as soon as possible.

mattKorwel commented 5 years ago

Thanks @botatoes. Also, once the new versions of the CLI and SDK's roll out anyone who starts using them should gain the benefits of the new system immediately as these new versions will not point to the legacy service.

I realize this does not help for existing clients who haven't or can't update (thus the work to redirect the existing services to the new one) but using the new sdk/cli might help verify that the scale issues are indeed resolved in the meantime. Let myself or @botatoes know if you'd like updated when the new versions are available @Dnld et AL.

ekrapfl commented 5 years ago

Thanks for the updates @bluto56 and @botatoes! Do you have any rough estimates on time frame for the new version? 1 week, 1 month, 6 months? Nothing I will hold anyone to, but my company is investigating CodePush vs Ionic Appflow right now, and these issues are obviously a big deciding factor for us.

Thanks!

alexcroox commented 5 years ago

Is this a problem that can be mitigated in the short term through vertical scaling of resources to handle those peak hours?

botatoes commented 5 years ago

@ekrapfl should be done in the next month or two 👀 but that's also super rough estimate. We're on the final 4 features, assuming we don't discover anything we weren't aware of already.

botatoes commented 5 years ago

@alexcroox we did dig a bit more into this and there's no easy way to do this without pulling off some people working on the new service for a while. We inherited the legacy service from another team at MS and have only recently decided to put more effort and improvements into CodePush. The old service was never optimized for scale unfortuantely :(

ekrapfl commented 5 years ago

Thanks for the insights @botatoes. Not to worry, I will not be holding you to that estimate, but I appreciate the transparency.

lpikora commented 5 years ago

Now it seems that your update servers are completely down :( just getting infinite checking for update or an error @botatoes

ghost commented 5 years ago

We're seeing the same.

lklepner commented 5 years ago

My team has stood up an in-house metadata service to handle these requests until the new version of Codepush is ready to go.

This was accomplished by forked the react-native-code-push repo and editing the request-fetch-adapter.js so requests to https://codepush.azurewebsites.net/updateCheck? are rewritten to point to our in-house API service.

On the server side we are obtaining a list of codepush releases from the AppCenter API https://api.appcenter.ms/v0.1/apps/{owner_name}/{app_name}/deployments/{deployment_name}/releases.

When clients send an metadata request to our in-house metadata service we use the url variables label and appVersion to check which codepush release (now stored in our db) is applicable to that client and return the relevant metadata.

JakubOleksy commented 5 years ago

We are investigating.

torontoerik commented 5 years ago

@lklepner did your metadata service fix this issue for you? Beyond checking for updates, is downloading a needed update also experiencing issues?

lklepner commented 5 years ago

@torontoerik From what we've experienced the problem largely resides on the metadata ends of things and the bundle download itself happens fairly quickly. The self-hosted metadata service has significantly improved the situation for us.

codeflows commented 5 years ago

We also tackled this issue by writing a caching proxy for the checkUpdate API. Bundle downloads are not a problem since they seem to come straight from the Azure CDN.

szh commented 5 years ago

The checkUpdate API seems to be back up

szh commented 5 years ago

@JakubOleksy Is the outage officially resolved?

bbialas commented 5 years ago

@JakubOleksy @botatoes https://codepush.azurewebsites.net/v0.1/public/codepush/update_check?deployment_key=XXX&app_version=1.4&package_hash=YYY&is_companion=false&client_unique_id=ZZZ

This endpoint returns 404... Is it possible to add any redirect on your side to use the latest appcenter domain https://codepush.appcenter.ms/ ??

I created separate ticket for that because it's critical issue I believe... :( https://github.com/microsoft/cordova-plugin-code-push/issues/567

----EDIT---- Issue fixed!

botatoes commented 5 years ago

Hi all, we have released a new version of the SDK. We're currently working on proxying the old API calls for a migration plan but the slow server and unstable issue should be resolved if you upgrade to the newest SDK.

cjonsmith commented 5 years ago

Hey all, in an attempt to be more transparent with our findings around the issues around the existing Code Push service, I’d like to go over some of our findings during our investigation into the performance degradation issues surrounding API requests made from our Cordova and React Native plugins/SDKs.

Since ~August 1st 2019, we’ve seen an increase in request durations and 4xx responses. At the time, we were unable to identify any issues in our service that could be the root cause of the failing requests. However, on August 22nd we experienced a complete outage that caused our service to fail to process the vast majority of requests, see https://github.com/microsoft/code-push/issues/659. At this point, we identified that one of our agents had exhausted all of its TCP ports, presumably when attempting to request information from our backend. We also identified this agent was serving almost all requests made to the Code Push service, which leads us to believe that we are failing to round robin our requests efficiently. We were able to redeploy the service successfully, freeing up the opened ports and are seeing request times return to normal. request_duration) Read chart as date over average request duration of all client based API requests. Date range is ~Aug 1 - Aug 29

Going forward, we’ll monitor the service more closely knowing that it may not be configured properly to handle the current volume of requests its seeing for an extended period of time, but as @botatoes has pointed out, we’re focusing nearly all of our attention on migrating all of our clients to use our new Code Push service. We’ve recently released a new version of both our Cordova plugin and our React Native SDK that depend on our new service, which we have more control over and can offer more immediate support for. That said, we will continue to support the existing Code Push service where we can, but we recommend upgrading to the newest versions of our plugin and SDK at your earliest convenience.

We apologize for the slow turn around on this issue, but we have identified the problem and are aware of how to mitigate the issue should it resurface in the future. That said, at this time, we are not planning on attempting to correct the root cause, as we hope to be fully cut over to the new service before the issue can resurface and wish to invest as much time as possible to get there as soon as possible. Thank you for your patience while we work to migrate to our new Code Push service and for reporting this issue! Feel free to continue the conversation if you have any additional questions.

lklepner commented 4 years ago

We have upgraded to the new service and things are working well. Many thanks @cjonsmith, @botatoes, @JakubOleksy

torontoerik commented 4 years ago

@cjonsmith @botatoes

The old endpoint instability is back as of yesterday. Can you mitigate like you described in your Aug 29th comment?

Unfortunately, we still have some apps pointing to the old end point. We are pushing through updates but if you can help stabilize that would be great!

BbFGE commented 4 years ago

Hi have been getting 503 error for nearly a day now, which is worrying as we are imminently launching a new app. I've seen mention that the new sdk fixes this - not sure what this means in terms of the code - we are using cordova (via Phonegap) - does this mean adding a dfferent codepush plugin? We are using cordova-plugin-zip 3.1.0.

Any help gratefully received!

andrei-m-code commented 4 years ago

I've been seeing code-push being slow or not delivering latest version at all for quite some time and only updating after multiple retries/restarts. Please look into the issues. It affects ALL our clients - it's a huge concern for our business. Thank you.

dukeflyheli commented 4 years ago

@cjonsmith We are also seeing our apps get stuck in the current "checking for update". Is it possible for your team to repatch the issue as before? We have been unable to get a patch out to our mobile app. Also, is there any plans to add SLAs on these services? Our application has a large user base and we need to be able to rely on this service. Thanks

learnyst commented 4 years ago

@cjonsmith @botatoes we are using latest version of react-native-code-push. still we are facing slow response issues. also in app center stats the rollback percentage is ~10%.

learnyst commented 4 years ago

@botatoes @cjonsmith any update on this issue?

foolishsailor commented 4 years ago

We are experiencing this currently as well for the last day - apps getting 503 Service unavailable error when checking for updates. We are using latest version of cordova code push.

AppCenter status shows no outages - any updates on this issue?

hamam99 commented 4 years ago

We have the same issue, 504 service unavailable . We are using RN.

vicary commented 3 years ago

My team is experiencing this issue in a PoC app, also in the meantime we want to upgrade to a paid subscription to see if it improves (albeit not likely), we are encountering https://github.com/microsoft/appcenter/issues/1759 and we can't even pay for it.

fauzymk commented 2 years ago

@lklepner Hi, may i ask how you implemented the timeout?

We are also experiencing a high rate of extremely slow codepush responses and have implemented a 1.5 second timeout to avoid having users wait excessively long for an update while booting the app.

Below is a chart of the percentage of users which fail to obtain metadata in this time, which in some cases is as high as 100% image

The underlying data can be reviewed here - https://docs.google.com/spreadsheets/d/1sgpJTTmaTkaKM7L6GSUb_6cANBRJrZLVqdYpkg0WF4g/edit?usp=sharing

We have been considering putting a proxy between our app and the codepush metadata service as a means of providing a faster more reliable responses to our users.. but of course we'd prefer not to go this route.

@botatoes Would it be possible for us to join the conference call?

frozencap commented 1 year ago

It's been over 5 days now that the status page has been showing "Distribute is Experiencing Issues"

All of my apps using CodePush have become categorically unusable

If I release a new bundle and try and download it (4.5MB) over LTE, I get 10-15kbps download speeds.

I have seen other threads suggesting this may be the start of Microsoft sunsetting CodePush.

Could we get some update on the current status of the service AND the project ?

BbFGE commented 1 year ago

I have seen other threads suggesting this may be the start of Microsoft sunsetting CodePush.

I hope not! We have apps depending on this. Can anyone from Appcenter/Microsoft comment?