microsoft / chat-copilot

MIT License
2.02k stars 688 forks source link

Initial Deployment Issues #488

Closed raffertyuy closed 11 months ago

raffertyuy commented 11 months ago

This may not be a bug, but I'm having trouble deploying the app from scratch. (Note: I was able to deploy successfully a month ago, so I can confirm my AAD configurations are working).

Help?

Attempt 1: Error using ./deploy-azure.ps1 This is my command

./deploy-azure.ps1 -Subscription {VALUE} -DeploymentName sk-chatcopilot-20231010codebase -AIService AzureOpenAI -AIApiKey {VALUE} -AIEndpoint "https://resource.openai.azure.com/" -BackendClientId {VALUE} -FrontendClientId {VALUE} -TenantId common -ResourceGroup razcopilot-rg -Region eastus -WebAppServiceSku S1

This is the error message that I'm getting

{"status":"Failed","error":{"code":"DeploymentFailed","target":"/subscriptions/7308e0b7-489d-4f8b-80b7-832b0662d47d/resourceGroups/razcopilot-rg/providers/Microsoft.Resources/deployments/sk-chatcopilot-20231010codebase","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"BadRequest","target":"/subscriptions/7308e0b7-489d-4f8b-80b7-832b0662d47d/resourceGroups/razcopilot-rg/providers/Microsoft.Resources/deployments/sk-chatcopilot-20231010codebase","message":"{\r\n  \"Code\": \"BadRequest\",\r\n  \"Message\": \"Encountered an error (InternalServerError) from host runtime.\",\r\n  \"Target\": null,\r\n  \"Details\": [\r\n    {\r\n      \"Message\": \"Encountered an error (InternalServerError) from host runtime.\"\r\n    },\r\n    {\r\n      \"Code\": \"BadRequest\"\r\n    },\r\n    {\r\n      \"ErrorEntity\": {\r\n        \"Code\": \"BadRequest\",\r\n        \"Message\": \"Encountered an error (InternalServerError) from host runtime.\"\r\n      }\r\n    }\r\n  ],\r\n  \"Innererror\": null\r\n}"}]}}

Attempt 2: Using Deploy to Azure from this page

API is running image

I believe the SWA is now hosted in the same API endpoint, but nothing... image

Tried the SWA, which is in the RG - but nothing is deployed image image

threegitty350 commented 11 months ago

I have also encountered deployment issues. This began several weeks ago and cannot be updated or deployed from scratch to azure. I have also encountered the issue where it could not find the Bing Search resource. By commenting out that part of the deployment and manually deploying the resource and configuring the key, it's able to deploy but fails to start and showed the same screen as OP. Eventually it decided to start (not sure how but it did after the weekend) but when i click on sign in with Microsoft it returns the error "AADSTS900144: The request body must contain the following parameter: 'client_id'."

Any thoughts? I could not get a good deployment since Thursday the 21st I believe.

TaoChenOSU commented 11 months ago

Hello @raffertyuy,

Thank you for opening the issue!

Our main branch is evolving quickly, and it is not guaranteed to be stable. Please use the deploy-to-azure button to deploy a stable version of Copilot Chat. Please note that if you deploy to an existing resource group, you may still see resources that are no longer needed as deploying doesn't wipe out existing resources.

TaoChenOSU commented 11 months ago

I have also encountered deployment issues. This began several weeks ago and cannot be updated or deployed from scratch to azure. I have also encountered the issue where it could not find the Bing Search resource. By commenting out that part of the deployment and manually deploying the resource and configuring the key, it's able to deploy but fails to start and showed the same screen as OP. Eventually it decided to start (not sure how but it did after the weekend) but when i click on sign in with Microsoft it returns the error "AADSTS900144: The request body must contain the following parameter: 'client_id'."

Any thoughts? I could not get a good deployment since Thursday the 21st I believe.

Hi @threegitty350,

Could you please open a new issue you encountered with the Bing Search resource?

glahaye commented 11 months ago

@raffertyuy Notwithstanding what Tao wrote above, the deployment script shouldn't fail. I'll take a look into this.

glahaye commented 11 months ago

OK. There are a few things to unpack here...

First, I'll assume you have the required permissions to deploy all the resource types needed for a Chat Copilot deployment.

One way to get more information on attempted deployments is to add the -DebugDeployment switch at the end of the deploy-azure script. This will cause the deployment of the underlying ARM template to display hopefully more useful information.

Now onto the deployment themselves... As Tao mentioned, the code has changed a lot recently and some of the binaries used to support the deployment were obsolete. I have just updated them and you should be good to go now.

As a reminder, there is no more Static Web App resource as of release 0.7 (last week). Instead, the static files are now hosted by the backend by default (though you could host them elsewhere if you wanted). So make sure you point to your Web App Service to see whether your deployment works.

The number of resources deployed has grown a lot recently and I see there are race conditions that are possible and would cause the ARM template (especially for Application Insights-related resources). I will have to look into this.

Also, the reason you didn't get a frontend using the Deploy to Azure button is that a second step needed to be done manually after clicking the "Deploy to Azure" button (using the deploy-webapp script, which no longer exists). Simplifying this is actually one of the reasons the default hosting of the frontend file is now done from the Web App.

So, hopefully, you should be good now although the ARM template can still use some streamlining and protection against deployment race conditions.

Also, the safest way to avoid problems should any of the binaries lag again in the deployment resources would be to invoke the following scripts in this order:

.\deploy-azure.ps1 .\package-webapi.ps1 .\deploy-webapi.ps1 .\package-memorypipeline.ps1 .\deploy-memorypipeline.ps1 .\package-plugins.ps1 .\deploy-plugins.ps1

An overall script to do all this is coming next week along with the streamlining of the ARM template.

douglasware commented 11 months ago

You should not be pulling such problematic changes into the main branch when they don't work. Moving fast is not a polite justification. Branches are free to create. Thank you for all your doing, but please do better. <3

raffertyuy commented 11 months ago

OK. There are a few things to unpack here...

First, I'll assume you have the required permissions to deploy all the resource types needed for a Chat Copilot deployment.

One way to get more information on attempted deployments is to add the -DebugDeployment switch at the end of the deploy-azure script. This will cause the deployment of the underlying ARM template to display hopefully more useful information.

Now onto the deployment themselves... As Tao mentioned, the code has changed a lot recently and some of the binaries used to support the deployment were obsolete. I have just updated them and you should be good to go now.

As a reminder, there is no more Static Web App resource as of release 0.7 (last week). Instead, the static files are now hosted by the backend by default (though you could host them elsewhere if you wanted). So make sure you point to your Web App Service to see whether your deployment works.

The number of resources deployed has grown a lot recently and I see there are race conditions that are possible and would cause the ARM template (especially for Application Insights-related resources). I will have to look into this.

Also, the reason you didn't get a frontend using the Deploy to Azure button is that a second step needed to be done manually after clicking the "Deploy to Azure" button (using the deploy-webapp script, which no longer exists). Simplifying this is actually one of the reasons the default hosting of the frontend file is now done from the Web App.

So, hopefully, you should be good now although the ARM template can still use some streamlining and protection against deployment race conditions.

Also, the safest way to avoid problems should any of the binaries lag again in the deployment resources would be to invoke the following scripts in this order:

.\deploy-azure.ps1 .\package-webapi.ps1 .\deploy-webapi.ps1 .\package-memorypipeline.ps1 .\deploy-memorypipeline.ps1 .\package-plugins.ps1 .\deploy-plugins.ps1

An overall script to do all this is coming next week along with the streamlining of the ARM template.

Thanks for the tip on script run-order. This will come in handy with my initial objectives.

Anyway, I followed @TaoChenOSU 's advise and deployed through the deploy-to-azure button first. I am getting intermittent errors though...

My first time deploying resulted in this errors deploying 3 log analytics workspaces image

After deleting/purging and trying again, I got a different application insights deployment error image

Help? :)

douglasware commented 11 months ago

@raffertyuy I think the webapi config is not getting set up right which just causes the web app to start and immediately fail. It doesn't emit any useful telemetry (that I could see) to indicate what the specific problem is. The most recent version I can get to work at all (when deployed to Azure) is c9e585d6 from September 19 which is the last commit before they added the new memory stuff. Running locally is fine.

TaoChenOSU commented 11 months ago

OK. There are a few things to unpack here... First, I'll assume you have the required permissions to deploy all the resource types needed for a Chat Copilot deployment. One way to get more information on attempted deployments is to add the -DebugDeployment switch at the end of the deploy-azure script. This will cause the deployment of the underlying ARM template to display hopefully more useful information. Now onto the deployment themselves... As Tao mentioned, the code has changed a lot recently and some of the binaries used to support the deployment were obsolete. I have just updated them and you should be good to go now. As a reminder, there is no more Static Web App resource as of release 0.7 (last week). Instead, the static files are now hosted by the backend by default (though you could host them elsewhere if you wanted). So make sure you point to your Web App Service to see whether your deployment works. The number of resources deployed has grown a lot recently and I see there are race conditions that are possible and would cause the ARM template (especially for Application Insights-related resources). I will have to look into this. Also, the reason you didn't get a frontend using the Deploy to Azure button is that a second step needed to be done manually after clicking the "Deploy to Azure" button (using the deploy-webapp script, which no longer exists). Simplifying this is actually one of the reasons the default hosting of the frontend file is now done from the Web App. So, hopefully, you should be good now although the ARM template can still use some streamlining and protection against deployment race conditions. Also, the safest way to avoid problems should any of the binaries lag again in the deployment resources would be to invoke the following scripts in this order: .\deploy-azure.ps1 .\package-webapi.ps1 .\deploy-webapi.ps1 .\package-memorypipeline.ps1 .\deploy-memorypipeline.ps1 .\package-plugins.ps1 .\deploy-plugins.ps1 An overall script to do all this is coming next week along with the streamlining of the ARM template.

Thanks for the tip on script run-order. This will come in handy with my initial objectives.

Anyway, I followed @TaoChenOSU 's advise and deployed through the deploy-to-azure button first. I am getting intermittent errors though...

My first time deploying resulted in this errors deploying 3 log analytics workspaces image

After deleting/purging and trying again, I got a different application insights deployment error image

Help? :)

Hello @raffertyuy,

Thank you for uploading the screenshot! I just deployed using the deploy-to-azure button and it worked. As @glahaye mentioned, there might be a race condition in the deployment template, I know it's not ideal, but I believe the best bet to unblock you is to try redeploying while we investigate.

TaoChenOSU commented 11 months ago

Also if you are still seeing issues, could you please post the options you set in the deployment template (with the secrets hidden)? image

douglasware commented 11 months ago

Here are my settings. The click deployment failed the first time, worked the second time, but like every time, regardless of the commit I've tried, no version I have tried after the first memory commit works, the web app crashes on startup. The last build I have seen work in Azure with my own eyes is the one from Sep 19.

image

image

douglasware commented 11 months ago

BTW... if you figure out what this issue is, I beg of you, fix your logging and error handling to make it possible to tell instead of just falling over dead with no output. :)

TaoChenOSU commented 11 months ago

Here are my settings. The click deployment failed the first time, worked the second time, but like every time, regardless of the commit I've tried, no version I have tried after the first memory commit works, the web app crashes on startup. The last build I have seen work in Azure with my own eyes is the one from Sep 19.

image

image

Please refer to this post on how to view the logs: https://github.com/microsoft/chat-copilot/issues/423.

Could you please deploying with memoryStore set to AzureCognitiveSearch?

TaoChenOSU commented 11 months ago

I believe I have found the issue. We don't support volatile anymore. Will issue a fix soon. image

raffertyuy commented 11 months ago

I believe I have found the issue. We don't support volatile anymore. Will issue a fix soon. image

Thanks for investigating. My screenshots above were from deploying to Azure Cognitive Search as the memory store. I will try again soon and give a screenshot of my inputs... maybe after your fix :)

TaoChenOSU commented 11 months ago

Task to remove Volatile and Postgres tracked by: https://github.com/microsoft/chat-copilot/issues/510

glahaye commented 11 months ago

@raffertyuy The blocking aspect of this issue should be fixed now. It was due to some obsolete default settings. Please give it another try and let me know how it went. I'm not closing this issue until your deployment works!

I eliminated some source of resource deployment race conditions but believe there is still one I need to address. I hope to have a PR for that today. In the meantime, you might have transient failures when deploying which can be mitigated by making a second deployment attempt.

@TaoChenOSU is also currently working on making the web search plugin optional (and I believe off by default), which will eliminate another common source of deployment problems.

@douglasware Admittedly, the past couple of weeks have shown us that we need to better deal with change velocity and how it can affect stability. We are actually looking at implementing a branching scheme more sophisticated than just putting everything in main.

As for the the logging, did you check in App Insights to get visibility into the problem? Suggestions as to where / how to log are welcome since we might be tunnel-visioned in debugging a certain way which doesn't necessarily correspond to how folks would like to do so.

glahaye commented 11 months ago

The web search plugin is now optional (and off by default) so that shouldn't be an issue anymore either.

threegitty350 commented 11 months ago

Deploys for me just fine now. Thank you for your amazing work! :)

douglasware commented 11 months ago

Thanks for the great work! I'll have a go later today

douglasware commented 11 months ago

As for the the logging, did you check in App Insights to get visibility into the problem? Suggestions as to where / how to log are welcome since we might be tunnel-visioned in debugging a certain way which doesn't necessarily correspond to how folks would like to do so.

My theory that day was that it was crashing setting the middleware up and that was why turning up the app insights logging was not giving me anything to go on, i.e. it was crashing too soon to talk to app insights. I was able to use the filesystem logs and system events but the output wasn't much to go on.

Cat-Vader commented 11 months ago

@douglasware Did you get an Azure deployment working? I noticed once deployed there are a few configuration settings within Azure App service that need to be changed, some of the environment variables are still defined to local deployments; the server listens on localhost for incoming requests, while it should really be the app service url with the consequent port. I have also run into an error where the isnt a chat directory within the file structure but then, the path is called when you try to initiate a conversation

Cat-Vader commented 11 months ago

Other than that, the plugins work fine, the health test is also fine and auth is okay; I might have missed something if you got your instance working please let me know how

raffertyuy commented 11 months ago

Thank you @glahaye and @TaoChenOSU for the latest updates. It is working for me now.

There are 2 minor issues, but easily resolved:

  1. Racing condition errors. workaround: redeploy again (without deleting anything).
  2. Deployment did not automatically update my Entra ID App Registration with the new redirect URI. I believe this was done automatically before. Anyway, easy manual step after deployment.

Feel free to close this issue if you think these should be tracked separately.

glahaye commented 11 months ago

@raffertyuy I have a potential fix for the race condition which I am testing now.

As for the App Registration, you can turn it on by using the -EnsureUriInAppRegistration flag with the deploy-webapi script. Your experience suggests this should be the other way around: done by default with the possibility to skip. I will make that change.

glahaye commented 11 months ago

@raffertyuy -EnsureUriInAppRegistration is no longer needed (and in fact no longer exists).

By default now, just invoking the deploy-webapi script takes care of all that's needed.

Also, I'll have a PR for the race condition shortly.

glahaye commented 11 months ago

Closing this issue.

Opened #539 to track deployment race conditions specifically.