khellang / Middleware

Various ASP.NET Core middleware
MIT License
811 stars 112 forks source link

Randomly occurring NullReferenceException #165

Open antonpopov opened 2 years ago

antonpopov commented 2 years ago

Hi! All of our webservices use ProblemDetails middleware. However, going through the logs we found out that in some cases (but not always) when an exception occurs during the execution of the application another exception occurs in the ProblemDetails middleware. Here's the stack trace:

An exception was thrown attempting to execute the problem details middleware.

System.NullReferenceException: Object reference not set to an instance of an object. at Microsoft.AspNetCore.Mvc.Infrastructure.DefaultOutputFormatterSelector.ValidateContentTypes(MediaTypeCollection contentTypes) at Microsoft.AspNetCore.Mvc.Infrastructure.DefaultOutputFormatterSelector.SelectFormatter(OutputFormatterCanWriteContext context, IList`1 formatters, MediaTypeCollection contentTypes) at Microsoft.AspNetCore.Mvc.Infrastructure.ObjectResultExecutor.ExecuteAsyncCore(ActionContext context, ObjectResult result, Type objectType, Object value) at Hellang.Middleware.ProblemDetails.ProblemDetailsMiddleware.WriteProblemDetails(HttpContext context, ProblemDetails details) at Hellang.Middleware.ProblemDetails.ProblemDetailsMiddleware.HandleException(HttpContext context, ExceptionDispatchInfo edi)

What's really odd is that we couldn't find a pattern. We couldn't reproduce the issue by executing some of the failing requests in debug or when the app is deployed. All we can see is that sometimes the beforementioned NullReference exception is thrown.

Any help or guidance will be much appreciated!

Target runtime: NET6.0 Hellang.Middleware.ProblemDetails version: 6.4.0

khellang commented 2 years ago

Hmm. That's a really weird one. Looks like it's happening deep in MVC code. Forunately, the ValidateContentTypes only has a single call site:

https://github.com/dotnet/aspnetcore/blob/ecc4d17263bad39769bb4b71c9cc2a0d2281e931/src/Mvc/Mvc.Core/src/Infrastructure/DefaultOutputFormatterSelector.cs#L57-L74

It has a precondition which asserts that contentTypes is not null, which leaves very little to be null in this method:

https://github.com/dotnet/aspnetcore/blob/ecc4d17263bad39769bb4b71c9cc2a0d2281e931/src/Mvc/Mvc.Core/src/Infrastructure/DefaultOutputFormatterSelector.cs#L284-L299

🤔

khellang commented 2 years ago

And the code for ValidateContentTypes hasn't been touched since 2017. Could it be some race condition cause by HttpContext being used outside a request scope or something?

khellang commented 2 years ago

Hmm, could it be that the requests have a wildcard Accept header and it's the exception message that fails?

It tries to get the FormatObjectResult_MatchAllContentType resource for the exception message, but I can't find any reference to that resource in the repo. However, I found this MatchAllContentTypeIsNotAllowed resource, which I can't see used anywhere in code. I wonder if that's the issue?

In any case, this is probably an issue for dotnet/aspnetcore 😄

khellang commented 2 years ago

Nope, that's not it. When generating methods for strings containing formatting arguments for the Resources class, the generator adds a Format prefix, which explains why I didn't find any references to it. There's indeed an ObjectResult_MatchAllContentType resource 😞

antonpopov commented 2 years ago

And the code for ValidateContentTypes hasn't been touched since 2017. Could it be some race condition cause by HttpContext being used outside a request scope or something?

I think that's not the case. However we are still investigating and trying to reproduce the issue. Thanks for the time.

kamilk commented 2 years ago

We experienced the same problem but we figured out it was caused by being on an old version of ProblemDetails. What worries me is that the OP reports using a version that, according to my investigation, should have this problem fixed. We haven't had time to thoroughly check the issue no longer occurs but early results look promising.

Maybe some information about the problem will be helpful:

The NullReferenceException happens if the ContentTypes property contains a null element. This causes DefaultOutputFormatterSelector.ValidateContentTypes (source) to throw when calling the constructor of MediaType, as it attempts to access the Length property of the null string (source). Which leads to the question, how does an null element end up in ContentTypes?

ProblemDetails 5.4.0 provided the Executor with a reference to Options.ContentTypes (source). This was problematic, as this is effectively a singleton object, reused for every request, but is implemented as a thread-unsafe List<>. ObjectResultExecutor modifies this collection by adding two elements to it in InferContentTypes (source). Growing the singleton collection by two elements every time the middleware formats a problem response is bad enough but the real problem is the risk of a race condition. If two threads execute InferContentTypes concurrently, then the _size of the list is incremented twice but both threads may end up assigning a new element to the same index of the array - leaving one space in the array unassigned. Hence a null value in ContentTypes.

Once a null element ends up in ContentTypes, it stays there until the process is restarted, and until then the exception occurs every time. If the service is load-balanced, then only some instances may be affected, giving an impression of randomness.

But this appears to have been fixed by ProblemDetails v5.4.1: the middleware no longer passes the original singleton collection to the Executor but it passes its clone instead.

Therefore I'm interested to know, is the OP definitely on 6.4.0 and not 5.4.0? Is it possible the race condition on ContentType is caused elsewhere in their application?

I will post again on how we fare with the newer version of the NuGet.

khellang commented 2 years ago

Thanks for the excellent analysis @kamilk! 🙏🏻 The fact that the executor actually mutates the ContentTypes collection is definitely a gotcha. Good thing it seems to be fixes, which probably was an accident 😂