microsoft / CloudAdoptionFramework

Code samples and extended documentation to support the guidance provided in the Microsoft Cloud Adoption Framework
https://aka.ms/CAF
Creative Commons Attribution 4.0 International
864 stars 621 forks source link

AzNamingTool - Application is crashing because of logfiles? #116

Closed ciprianglg closed 1 year ago

ciprianglg commented 1 year ago

I have the app running in a Web App for container running in Azure. After a period of time, while you use the application, and you generate some names for your resources, application is crashing, with the following error in the logs:

`2022-11-22T13:16:18.981924037Z warn: Microsoft.AspNetCore.Components.Server.Circuits.RemoteRenderer[100]

2022-11-22T13:16:18.981974937Z Unhandled exception rendering component: Object reference not set to an instance of an object.

2022-11-22T13:16:18.986191553Z System.NullReferenceException: Object reference not set to an instance of an object.

2022-11-22T13:16:18.986258153Z at AzureNamingTool.Pages.Generate.b__0_2(RenderTreeBuilder __builder2) in /src/Pages/Generate.razor:line 51

2022-11-22T13:16:18.987751559Z at Microsoft.AspNetCore.Components.CascadingValue`1.Render(RenderTreeBuilder builder)

2022-11-22T13:16:18.987771959Z at Microsoft.AspNetCore.Components.Rendering.ComponentState.RenderIntoBatch(RenderBatchBuilder batchBuilder, RenderFragment renderFragment, Exception& renderFragmentException)

2022-11-22T13:16:18.995392088Z fail: Microsoft.AspNetCore.Components.Server.Circuits.CircuitHost[111]

2022-11-22T13:16:18.995411488Z Unhandled exception in circuit 'oxQ_b7CYUxOGTM_tNLhnMp3yEgo1tB9rEyTJR77pVis'.

2022-11-22T13:16:18.996416992Z System.NullReferenceException: Object reference not set to an instance of an object.

2022-11-22T13:16:18.996460292Z at AzureNamingTool.Pages.Generate.b__0_2(RenderTreeBuilder __builder2) in /src/Pages/Generate.razor:line 51

2022-11-22T13:16:18.996466892Z at Microsoft.AspNetCore.Components.CascadingValue`1.Render(RenderTreeBuilder builder)

2022-11-22T13:16:18.997848997Z at Microsoft.AspNetCore.Components.Rendering.ComponentState.RenderIntoBatch(RenderBatchBuilder batchBuilder, RenderFragment renderFragment, Exception& renderFragmentException)

2022-11-22T13:16:19.205078882Z info: Microsoft.AspNetCore.Routing.EndpointMiddleware[1]

2022-11-22T13:16:19.205128082Z Executed endpoint '/_blazor'

2022-11-22T13:16:19.206160786Z info: Microsoft.AspNetCore.Hosting.Diagnostics[2]`

or

`2022-11-22T13:46:40.396486506Z fail: Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware[1]

2022-11-22T13:46:40.396514007Z An unhandled exception has occurred while executing the request.

2022-11-22T13:46:40.396520307Z System.NullReferenceException: Object reference not set to an instance of an object.

2022-11-22T13:46:40.396524807Z at AzureNamingTool.Pages.Reference.GenerateName(ResourceType type, String shortName, String staticvalues) in /src/Pages/Reference.razor:line 389

2022-11-22T13:46:40.396529107Z at AzureNamingTool.Pages.Reference.BuildRenderTree(RenderTreeBuilder __builder) in /src/Pages/Reference.razor:line 92

2022-11-22T13:46:40.396533407Z at Microsoft.AspNetCore.Components.Rendering.ComponentState.RenderIntoBatch(RenderBatchBuilder batchBuilder, RenderFragment renderFragment, Exception& renderFragmentException)

2022-11-22T13:46:40.396537907Z --- End of stack trace from previous location ---

2022-11-22T13:46:40.396541807Z at Microsoft.AspNetCore.Components.Rendering.HtmlRenderer.HandleException(Exception exception)

2022-11-22T13:46:40.396545907Z at Microsoft.AspNetCore.Components.RenderTree.Renderer.HandleExceptionViaErrorBoundary(Exception error, ComponentState errorSourceOrNull)

2022-11-22T13:46:40.396549707Z at Microsoft.AspNetCore.Components.RenderTree.Renderer.ProcessRenderQueue()

2022-11-22T13:46:40.396553607Z --- End of stack trace from previous location ---

2022-11-22T13:46:40.396557407Z at Microsoft.AspNetCore.Components.Rendering.HtmlRenderer.HandleException(Exception exception)

2022-11-22T13:46:40.396561307Z at Microsoft.AspNetCore.Components.RenderTree.Renderer.ProcessRenderQueue()

2022-11-22T13:46:40.396582707Z at Microsoft.AspNetCore.Components.RenderTree.Renderer.AddToRenderQueue(Int32 componentId, RenderFragment renderFragment)

2022-11-22T13:46:40.396587407Z at Microsoft.AspNetCore.Components.ComponentBase.StateHasChanged()

2022-11-22T13:46:40.396591207Z at Microsoft.AspNetCore.Components.ComponentBase.CallOnParametersSetAsync()

2022-11-22T13:46:40.396595007Z at Microsoft.AspNetCore.Components.ComponentBase.RunInitAndSetParametersAsync()

2022-11-22T13:46:40.396599307Z at Microsoft.AspNetCore.Components.Rendering.HtmlRenderer.HandleException(Exception exception)

2022-11-22T13:46:40.396603407Z at Microsoft.AspNetCore.Components.RenderTree.Renderer.HandleExceptionViaErrorBoundary(Exception error, ComponentState errorSourceOrNull)

2022-11-22T13:46:40.396608307Z at Microsoft.AspNetCore.Components.Rendering.ComponentState.SetDirectParameters(ParameterView parameters)

2022-11-22T13:46:40.396612707Z at Microsoft.AspNetCore.Components.RenderTree.Renderer.RenderRootComponentAsync(Int32 componentId, ParameterView initialParameters)

2022-11-22T13:46:40.396616707Z at Microsoft.AspNetCore.Components.Rendering.HtmlRenderer.CreateInitialRenderAsync(Type componentType, ParameterView initialParameters)

2022-11-22T13:46:40.396620607Z at Microsoft.AspNetCore.Components.Rendering.HtmlRenderer.RenderComponentAsync(Type componentType, ParameterView initialParameters)

2022-11-22T13:46:40.396624507Z at Microsoft.AspNetCore.Components.Rendering.RendererSynchronizationContext.<>c11`1.<b11_0>d.MoveNext()

2022-11-22T13:46:40.396628907Z --- End of stack trace from previous location ---

2022-11-22T13:46:40.396632807Z at Microsoft.AspNetCore.Mvc.ViewFeatures.StaticComponentRenderer.PrerenderComponentAsync(ParameterView parameters, HttpContext httpContext, Type componentType)

2022-11-22T13:46:40.396636707Z at Microsoft.AspNetCore.Mvc.ViewFeatures.ComponentRenderer.PrerenderedServerComponentAsync(HttpContext context, ServerComponentInvocationSequence invocationId, Type type, ParameterView parametersCollection)

2022-11-22T13:46:40.396640807Z at Microsoft.AspNetCore.Mvc.ViewFeatures.ComponentRenderer.RenderComponentAsync(ViewContext viewContext, Type componentType, RenderMode renderMode, Object parameters)

2022-11-22T13:46:40.396644807Z at Microsoft.AspNetCore.Mvc.TagHelpers.ComponentTagHelper.ProcessAsync(TagHelperContext context, TagHelperOutput output)

2022-11-22T13:46:40.396648707Z at Microsoft.AspNetCore.Razor.Runtime.TagHelpers.TagHelperRunner.g__Awaited|0_0(Task task, TagHelperExecutionContext executionContext, Int32 i, Int32 count)

2022-11-22T13:46:40.396653207Z at AzureNamingTool.Pages.Pages__Host.ExecuteAsync() in /src/Pages/_Host.cshtml:line 10

2022-11-22T13:46:40.396657007Z at Microsoft.AspNetCore.Mvc.Razor.RazorView.RenderPageCoreAsync(IRazorPage page, ViewContext context)

2022-11-22T13:46:40.396661007Z at Microsoft.AspNetCore.Mvc.Razor.RazorView.RenderPageAsync(IRazorPage page, ViewContext context, Boolean invokeViewStarts)

2022-11-22T13:46:40.396664908Z at Microsoft.AspNetCore.Mvc.Razor.RazorView.RenderAsync(ViewContext context)

2022-11-22T13:46:40.396672608Z at Microsoft.AspNetCore.Mvc.ViewFeatures.ViewExecutor.ExecuteAsync(ViewContext viewContext, String contentType, Nullable`1 statusCode)

2022-11-22T13:46:40.396677308Z at Microsoft.AspNetCore.Mvc.ViewFeatures.ViewExecutor.ExecuteAsync(ViewContext viewContext, String contentType, Nullable`1 statusCode)

2022-11-22T13:46:40.396681408Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.g__Awaited|30_0[TFilter,TFilterAsync](ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)

2022-11-22T13:46:40.396685708Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Rethrow(ResultExecutedContextSealed context)

2022-11-22T13:46:40.396689608Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.ResultNext[TFilter,TFilterAsync](State& next, Scope& scope, Object& state, Boolean& isCompleted)

2022-11-22T13:46:40.396694008Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.InvokeResultFilters()

2022-11-22T13:46:40.396697808Z --- End of stack trace from previous location ---

2022-11-22T13:46:40.396701608Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.g__Awaited|25_0(ResourceInvoker invoker, Task lastTask, State next, Scope scope, Object state, Boolean isCompleted)

2022-11-22T13:46:40.396705808Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Rethrow(ResourceExecutedContextSealed context)

2022-11-22T13:46:40.396709708Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)

2022-11-22T13:46:40.396714308Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.InvokeFilterPipelineAsync()

2022-11-22T13:46:40.396718308Z --- End of stack trace from previous location ---

2022-11-22T13:46:40.396724208Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.g__Logged|17_1(ResourceInvoker invoker)

2022-11-22T13:46:40.396728308Z at Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.g__Logged|17_1(ResourceInvoker invoker)

2022-11-22T13:46:40.396732408Z at Microsoft.AspNetCore.Routing.EndpointMiddleware.g__AwaitRequestTask|6_0(Endpoint endpoint, Task requestTask, ILogger logger)

2022-11-22T13:46:40.396736608Z at Swashbuckle.AspNetCore.SwaggerUI.SwaggerUIMiddleware.Invoke(HttpContext httpContext)

2022-11-22T13:46:40.396741008Z at Swashbuckle.AspNetCore.Swagger.SwaggerMiddleware.Invoke(HttpContext httpContext, ISwaggerProvider swaggerProvider)

2022-11-22T13:46:40.396745308Z at Microsoft.AspNetCore.Diagnostics.ExceptionHandlerMiddleware.g__Awaited|6_0(ExceptionHandlerMiddleware middleware, HttpContext context, Task task)

2022-11-22T13:46:40.396750008Z info: Microsoft.AspNetCore.Routing.EndpointMiddleware[0]

2022-11-22T13:46:40.396754208Z Executing endpoint '/Error'`

image

If i go and delete generatednames.json and adminlogmessages.json, and then restart web app, application is working fine again.

In case you don't delete the log files i receive a stack overflow error which is attached, and application is crashing constantly. stackOverFlowIssue.txt

How i use application? I'm having a function that calls the api to generate names for azure resources, using this api call /api/ResourceNamingRequests/RequestName

How you can replicate?

Just create a function to generate a name for all resource types using the following api call /api/ResourceNamingRequests/RequestName

"as","apim","apivs","apis","apii","apiia","apiic","apio","apiot","apir","apiss","apitd","apitg","apias","apib","apic","apid","apig","apigu","apiip","apil","apin","recem","apioi","apipo","apipr","api","grp","aspt","prop","subsr","ast","temp","user","appcs","lock","pasm","pasr","pasd","policyr","policyd","initm","initr","initd","aa","aacert","aaconn","cred","runbk","sched","var","watch","wbhk","ba","appl","bbcert","pool","blchm","bpas","bp","bpa","bpv","botsvc","chan","bbconn","botec","redis","crrule","cdnp","cdne","cerco","cog","avail","cld","des","osdisk","disk","gal","cgap","apver","gimg","ver","cimg","snap","slc","smc","vmssl","vmssw","ci","cr","btsk","step","rep","scmap","task","tokn","webhk","aks","osmc","cihub","authp","conn","chcmap","int","kpi","link","pred","prof","rlink","rel","role","view","cpas","cprep","dbjob","dbw","adf","flow","dset","run","lksvc","pipe","trig","rerun","dla","cmpol","dlsta","dlarule","dlasa","dls","dlsrule","dlsrulev","dms","proj","mdbsv","mdbdb","dmdsfrule","dmdsvrule","mysql","dbmysql","dmsfrule","dmsvrule","psql","dbpsql","dpsfrule","dpsvrule","iot","dicert","congp","provs","pcert","dtlab","img","form","vml","vmw","dbact","cosmos","evgd","evgt","evgs","egtp","evhcl","evhns","enarule","drcfg","evh","enearule","cgrp","mls","spark","storm","hadoop","hbase","kafka","arcs","iejob","ag","inala","appi","inma","insqr","iotapp","kv","kvs","arck","dec","dedb","dcon","ehcon","ia","assem","bconf","cert","limap","part","rpcnf","schem","ses","lise","mapi","logic","mlcp","mlweb","mlws","mlw","mlswc","id","mg","macc","mms","lievt","liout","strep","migr","sac","asg","afw","bas","con","dnsz","erc","waf","wafrg","fd","fdfp","sac","sac","nlirule","lgw","nic","nsg","nsgsr","nw","pdnsz","pdnsv","pl","pip","ippre","rf","nrfrr","rt","udr","se","traf","vgw","vnet","snet","peer","vwan","vpng","vcn","vpn","ntfns","nnarule","ntf","nnnarule","opic","log","opms","pdash","pbiwc","pbi","pview","rsv","rsvbp","rns","rnarule","hycon","rnharule","wcfr","wcfra","deloy","rg","rtn","rtnv","tspec","srch","sb","snarule","sbndc","migc","sbq","snqarule","sbt","sntarule","tsub","sntsrule","sf","sigr","sqlmi","sqldw","sql","sqlsa","sqldb","sqlsg","sqlep","sqlfg","sqlfw","sbl","squ","stvm","st","stbl","stblc","stfs","stfss","stmp","sta","sss","sgrp","ssimp","asa","sasj","sjf","sjin","sjout","sjtrf","subcr","synw","synsp","syndp","tsi","tsiea","tsiee","tsier","wbcer","plan","stapp","app","func","ase","aswba","slot"

Api call json body :

{"resourceType": "shortNameFromTheList","resourceUnitDept": "sud","resourceProjAppSvc": "spa","resourceEnvironment": "dev","resourceLocation": "we","resourceInstance": "001"}

iic-eon commented 1 year ago

Dear team, it will be helpful if the above can be solved. Thank you.

jamasten commented 1 year ago

Thank you for reporting the issue. We are looking into it and will respond here once we've implemented a fix.

ciprianglg commented 1 year ago

@jamasten if required i can provide de app url, and give credentials if will speed up, at least to understand what the problem might be, in case something requires more understanding.

BryanSoltis commented 1 year ago

Hi @ciprianglg,

Thank you for the feedback. I have been looking at the code and may have some ideas as to the cause, but need to clarify some aspects first.

  1. Your stack traces in your original message show errors on the Generate and Reference pages. However, later in your report under the "how to replicate", you mention calling the API. Can you please clarify how you are using the site and receiving the error?

  2. If you are using the site (not the API), are you receiving the error after some time of not accessing the site? The stack trace indicates the "delimiter" value is empty, so I believe the root cause is that the site no longer has the correct configuration data. I have added code that would reload this data if it's ever empty but wanted to confirm this was the scenario you are seeing.

    The site by default will attempt to cache configuration for 5 minutes (I have increased this to 10 minutes in the next update). When the configuration data is requested and it's empty/expired, then the site should repull the configuration data. This would require something happening on the site (reloading a page, navigating to another page, etc.). Can you detail the steps you take when you see the errors?

  3. When you stated you cleared the admin log/generated names log files, are you also restarting the application? If so, I believe the log aspect is unrelated, as I think the cause it the configuration not being loaded (see # 3 above).

  4. Your stacktrace.txt file shows errors when the site attempts to deserialize the data from the JSON files. If you are running the site as a container, these files would be stored in the mounted storage account. Can you confirm that the storage account contains the JSON files? You will need to browse to the storage account directly to look at the file share files.

Thank you for any additional details you can provide!

-Bryan

ciprianglg commented 1 year ago

@BryanSoltis sorry for late response, i was off for a couple of days. Point 1: On how to replicate is just how you can make the application hanging and not working any more. On my side I'm using the application mostly using API calls to generate names, and suddenly i don't get a reply back, and in the logs i can see the stack overflow error when i don't get the response back. Then I'm moving to the web app, to see what is going on, and there i can see those errors from screen shot and also posted when i navigate through the menu especially to Reference/Generate

Point 2: As mentioned in point 1, is just immediately after the API call is not working, I'm switching to the web interface to see if application is still working, and navigating through the interface

Point 3: Application is restarting very often because it says that is not healthy. I'm doing a stop and start, but still when it comes back is still crashing because it is not healthy. Only way to make it working properly again, is stopping the web app, delete those two log files, and start it again.

Point 4: I'm running it in Azure App Service in a web app for container, and to the app service i have mounted the file share from a storage account, and all the files are there all the time, as I'm using it to get rid of the log files to make app working again.

If required, we can have a session together for a couple of minutes, to go through the errors. Idea is that i want to use this app, especially the API in provisioning azure resources programmatically, and I'm finding it fit for my purpose, as i don't need to worry about what names i should provide to resources.

BryanSoltis commented 1 year ago

Hi @ciprianglg,

Thank you for the details! It definitely helps me understand better about what might be happening. In reviewing the stack trace again, I can see it's failing repeatedly to write to the Admin Log. When a name is requested via the API and the process fails, it will attempt to do this. The actual "RequestName" function has a single try/catch that will attempt this if anything goes wrong. What's odd is that your stack trace file has MANY occurrences of this in a short succession (milliseconds apart).

The "Write to Admin Log" process goes like this:

I think the issue might be when the site is attempting to pull those Admin Log Messages, it's having some error doing so. But, that function is also in a try catch and should just error out to a single response if it does.

During the development of the tool and after the first launch, I hosted it in an Azure App Service container, as well. While this worked almost every time, occasionally I would see an issue with it being able to connect to the storage mount. I was never able to fully identify the circumstances that lead to the issue, and restarting it usually resolved it. I'm wondering if that issue is what you are experiencing, as well.

I am looking at the code now to see if I can add any sort of check to "verify" that the storage is available, in the event that is causing an issue. It's not a solution, but would at least avoid the problem you are seeing, as the site should self-correct if it happens.

Once I get that fix in place, I will send you an invite to a private repo with the change, if you don't mind testing. I will attempt to recreate the problem, but it would be awesome if you could, as well.

ciprianglg commented 1 year ago

Hi @BryanSoltis imagine, that I'm trying to provision resources for a project, and in that project i have different kind of resources. Each time i do a change, there will be lots of request to get the names for me, and this will explain a lots of calls. In the example i have to reproduce the error quickly I'm just trying to request a name for all those 350+ short names, and this is why you can see those occurrences (not ideal) Also i was thinking that the problem is with the storage account, but i succeed to crash the app without storage account being mounted, so I've just eliminated that one, so don't think you should focus on that one. Bad part about not having the file share mounted was that, each time, the app crashes, the image is retrieved from registry and app is like new, and also I'm not able to find the logs in the container. About testing I'm ready when you are :).

BryanSoltis commented 1 year ago

OK, good to know. In looking at it some more, I think the issue may actually be with the caching. I'll be sharing a private repo shortly with the updates for you to test.

NOTE You will see tool/file version messages, as this code is ahead of the official release.

BryanSoltis commented 1 year ago

This has been fixed in the following PR: https://github.com/microsoft/CloudAdoptionFramework/pull/124

ciprianglg commented 1 year ago

@BryanSoltis I still have the issue that the application is crashing from time to time with the following stack overflow message: stackOverflowIssue31.01.2023.txt

I tried several time to restart the app service, stop it, start it, but i was not able to make it working again, as the container is crashing. Only way to solve it was to delete admin log and then application was started to work again. Any hints about it?

BryanSoltis commented 1 year ago

Hello Cirprian,

Sorry to hear you are still having issues. From the logs, it appears to have a problem with the Serialization/De-serialization of the Admin Log messages. Then, when it has this issue, it attempts to log the message in the Amin Log, further compounding the problem. Additional logic was added int eh previous PR to address this issue, but it seems like something else is still not working.

Can you please provide the following:

Also, I have created a new private GitHub and added you as a collaborator. Can you please add your /settings/adminlogmessages.json file to the repo so I can review. Ideally, I would like the log in the state when you experience the issue to determine if there is specific content.

Once I have the above, I will research this more to try and determine what the problem may be.

-Bryan

ciprianglg commented 1 year ago

@BryanSoltis I'm using the latest version of the tool, and is running inside a web app service. Unfortunately i didn't saved the adminlogmessages.json from storage account, and realized this later that i should keep it. In this case i will wait for a new failure and then, i will uploaded together with multiple logs from the app service, and in case we will not have a urgent need, i will share with you the links to see behavior. Is this ok with you?

BryanSoltis commented 1 year ago

Sounds good. Let me know when you upload the files to the private repo and I'll check it out. Thank you!

-Bryan

ciprianglg commented 1 year ago

To make a summary, i used self written code through which i called the api to return me generated name. My code was sending requests too fast, that made strange behavior to the application, which in the end will result in a stack overflow message (this message was generated because the log files especially adminlogmessages.json had a bad formatting, containing an extra "]"). My code was refactored and for the moment i was not able to generate the stack overflow message anymore.