Open tormnator opened 10 months ago
Hi there @tormnator!
Firstly, a big thank you for raising this issue. Every piece of feedback we receive helps us to make Umbraco better.
We really appreciate your patience while we wait for our team to have a look at this but we wanted to let you know that we see this and share with you the plan for what comes next.
We wish we could work with everyone directly and assess your issue immediately but we're in the fortunate position of having lots of contributions to work with and only a few humans who are able to do it. We are making progress though and in the meantime, we will keep you in the loop and let you know when we have any questions.
Thanks, from your friendly Umbraco GitHub bot :robot: :slightly_smiling_face:
Hey Torm, thank you for this report. Our resident deploy expert has also been able to trigger this error on a CMS 13.0.0 & Deploy 13.0.3 cloud project, so it's definitely not a one off.
I will check with the team on when we might be able to have a look at this and your suggestions around extra error catching/logging.
Which Umbraco version are you using? (Please write the exact version, example: 10.1.0)
12.3.6
Bug summary
This issue shows up in Umbraco Cloud, via Umbraco Deploy, but the relevant source code is in Umbraco CMS so I'm posting here.
During a deploy from my local Umbraco Deploy environment to my Development env, I end up with a bad looking error in the Cloud dashboard stating that the deploy failed. The shown error is "failed to get child with id=2458". The error makes it look like there's a problem with the deployment data, and it makes it look like the deployment failed. Neither is actually true. It turns out the problem is related to the NuCache not being able to refresh itself after the deploy has succeeded. Also, the error uses an int node id which makes it hard to search for and identify which node is being referenced (e.g. you can't search the .uda files for int node id's).
To help out I think a few changes/additions to the relevant source code could be made to 1) show the Guid of the node, 2) the type of node (content, media, content type, media type, etc.), and 3) during what part of the deploy process the error occurred.
Specifics
The exception is thrown in the ContentStore.GetRequiredLinkedNode method. This method doesn't have much helpful information to provide, but the calling ContentStore.ClearBranchLocked method has a reference to the parent ContentNode with more information. A try/catch wrapper here could add a message like "Unable to get reference to child content of type {content.ContentType.Name}. The parent node's id is {content.Uid}, path is {content.Path}." (pseudocode).
If we go further up the call stack, we see that the PublishedSnapshotService.RefreshMediaTypesLocked method is called. In this method we could add another try/catch wrapper with more useful information, e.g. "An exception was thrown while refreshing media types in the NuCache" (or something similar, I'm not sure if one can say NuCache here, or if this method is too general).
Finally, closer to the top of the exception call stack, we see that DistributedCacheExtensions.RefreshContentTypeCache() is called. This might be a good place to add more accurate information about where/when in the deployment process the error occurred.
Here's the entire error message and exception call stack as seen when digging into the error details (I have annotated the call stack a bit with line number info):
Steps to reproduce
I don't know how to reproduce the error. As mentioned above, it happens during Umbraco Cloud deployment process, and might be related to other bugs not yet identified.
There are a couple (at least) other issues which probably would've benefitted from improved error messages, maybe studying these will help reproducing the error situation:
Expected result / actual result
The desired outcome is to have error information which is clear enough to help you understand why the error occurred (a system failure when attempting to refresh the NuCache after deployment completed) and what corrective action to take (until the underlying bug(s) are fixed, force a rebuild of the cache (how to do that is another issue)).