umbraco / Umbraco-CMS

Umbraco is a free and open source .NET content management system helping you deliver delightful digital experiences.
https://umbraco.com
MIT License
4.45k stars 2.68k forks source link

This document is published but its url cannot be routed #9514

Closed sbosell closed 3 years ago

sbosell commented 3 years ago

Sometimes a node in umbraco will be in a state of published but not routable when this should not be allowed to occur. Umbraco in this case throws a 404 for that route.

There are several cases of this being reported as well as other issues reported here (duplicate)

Related to issue #7575

Umbraco version

I am seeing this issue on Umbraco version: 8.6.2, 8.7.x, and 8.9.1

Reproduction

I have a process that syncs data in the backend every 30 minutes to a node's descendants and then the content api calls SaveAndPublish and/or SaveAndPublishBranch (tested with both in the sync). The sync uses the content api via a recurring task, all standard features of Umbraco. In my case we have no cultures defined and only one root site. The most basic Umbraco setup. Others are reporting the same issue so it may have nothing to do with the Content Api.

About 1 time per day the root node of the sync will be in this state of the url can't be routed and it is random in that it only occurs every once in a while which leads me to believe there is a locking mechanism with how the cache works. This should never happen, there should be no code path that unpublished or deroutes a node like this in the save and publish.

Hosted on Azure in a single site (not load balanced) with all the appropriate settings set via the Umbraco Documentation.

Bug summary

A node becomes not routable when this should never happen.

Specifics

I can privately provide a URL

Steps to reproduce

It happens randomly

Expected result

A valid published node in Umbraco should NEVER be in a state of the URL not being routable.

Actual result

Umbraco backend reports the URL not being routable and the node is effectively unpublished.


_This item has been added to our backlog AB#9840_

sbosell commented 3 years ago

This error just occurred right now.

image

sbosell commented 3 years ago

This happened almost daily this week and is being reported by others. Duplicate ticket: https://github.com/umbraco/Umbraco-CMS/issues/9523

image

Jay-umbr commented 3 years ago

Seen several cases of this, too. Could not nail down the cause - seems to happen totally randomly, occurred sometimes after a deployment, sometimes after a content transfer on Cloud, sometimes just after leaving the site be and coming back to it after a few days. This happens regardless of models builder mode (had it happen on sites using PureLive and AppData), and on plain/package-less Umbraco installations.

sbosell commented 3 years ago

This continues to happen almost daily. Umbraco randomly unpublishes nodes and the content is 404 not found.

Nicholas-Westby commented 3 years ago

FYI, I've seen this with an Umbraco 8.6.2 install, so it has been happening at least since then.

sbosell commented 3 years ago

@Nicholas-Westby Thanks. I updated the ticket description to include 8.6.2.

kjac commented 3 years ago

In the related issues someone noted that they could fix the broken nodes by rebuilding the cache. I wonder if disabling the cache DB (thus forcing it to rebuild upon startup) is a workaround for this?

If anyone should be inclined to test this, try adding the following to an IUserComposer:

public void Compose(Composition composition)
{
  composition.Register(factory => new PublishedSnapshotServiceOptions
  {
    IgnoreLocalDb = true
  });
  // ...
}
sbosell commented 3 years ago

@kjac what related issue are you referring to?

I can test the change this week.

sbosell commented 3 years ago

I deployed the change this morning and the documentation for it is here. If the only consequence is a slower startup time on a new server I can live with that if it addresses this problem. I should point out we do not have replica servers.

sbosell commented 3 years ago

The change made no difference as we experienced the same issue this afternoon.

Shazwazza commented 3 years ago

Do you have multiple root nodes? And what domains do you have assigned to them? If you have multiple root nodes, only one of them can have no domains, all other ones must have domains.

sbosell commented 3 years ago

I have the simplest setup of one root and one domain and no language variants.

vaags commented 3 years ago

I'm seeing this problem on Umbraco 8.6.1. There's only one root node and only one domain, but there are multiple language variants. On several pages, only one of the langauges is routable, despite all of them being published.

nikolajlauridsen commented 3 years ago

It would be helpful if we can determine if it's caused indirectly by some action in the backoffice, or by some background task or something similar.

Do you know if the issue occurs at any time of the day, does it happen in the middle of the night when no one is using the back office as well?

Would it be possible for you to check the umbracoAudit table in the database as well? Maybe something happens leading up to the bug that might be able to help us.

sbosell commented 3 years ago

@nikolajlauridsen

The issue for us happens randomly and there are very few updates in the backoffice (maybe one time per week at the most). We do have a process that runs every 30 minutes that's outlined in the ticket. Here are some of the times of day it has occurred:

Here is the full audit log with the user info removed (don't want to get spammed).

kimschurmann commented 3 years ago

Could this issue please get some attention - we do not dare to upgrade to the next version from 8.5.5. We are kinda stuck - there are issues in 8.5.5 that are fixed in the next versions, but if we do - we risk new and more serious issues :sigh:

bergmania commented 3 years ago

We are continuously looking at this issue. Sadly we can't reproduce and we are therefore looking for a needle in a haystack.

@vaags and @sbosell, when you say "one domain" are you talking about domains/hostnames in Umbraco or did you not add any domains in "Culture and Hostnames"?

image

sbosell commented 3 years ago

@bergmania in my case we have english as the only language and we actually have domain.com And www.domain.com setup on that screen. I'm happy to share the backend info privately with someone from the Umbraco team.

orsinic commented 3 years ago

We also see nodes randomly losing the url ("This document is published but its url cannot be routed").

In our case we see it already for several months on two separate Umbraco Cloud projects with always the latest version (so the current latest version has the issue, but also several earlier versions). Unfortunately I cannot tell since which version exactly this started to occur.

It only seems to happen after a (code) deployment from a lower environment. I haven't seen this occur randomly without a deployment.

Both projects are setup with multiple subsites at the top level, each with their own domain configured in "Culture and Hostnames". For both projects multiple languages are used but in a v7 way (eg not making use of v8 variants). One of the projects is a baseline project (where we see this behavior on all child projects) and the other is a plain/single project with two environments. In one project we have jobs running that create content using the content service (similar to some other cases mentioned here). But in the other project we don't create content programmatically.

Rebuilding the cache fixes the issue for us. So unfortunately this is now part of our workflow after each deployment.

Shared info about this with @Jay-umbr via Umbraco Cloud support previously. If needed I can share extra information to find the root cause of this.

sbosell commented 3 years ago

@orsinic I can fix the issue by rebuilding the cache or by publishing the node that was derouted. It happens randomly for us (self hosted azure). Derouting shouldn't even be possible.

kimschurmann commented 3 years ago

Good to know there is a work around if the issue occurs - but I don't think I will upgrade just yet if a part of the deployment rutine is to rebuild cache of republish nodes after each deployment.

bergmania commented 3 years ago

For those of you, with only a single hostname assigned, could you try to remove the hostname and see if the error goes away. There's not really a reason to have just one hostname with one root node. If that 'fixes' the issue, then we know its a domain issue. That would help a lot.

Zweben commented 3 years ago

I was directed here by Umbraco support after encountering an issue with content suddenly disappearing from our production site the second time this month. The two times we had issues had similar circumstances and resolutions, but presented as different front-end problems, so I'll list them separately:

Earlier instance:

Latest instance:

Info about the site:

Errors I am seeing in the logs:

bergmania commented 3 years ago

Hi @Zweben.. Thansk for the detailed description. Do you have the stacktrace from the following exception System.Data.SqlClient.SqlException (0x80131904): Lock request time out period exceeded.

bergmania commented 3 years ago

When anyone is seeing this issue next time. Please go the content in backoffice and post the response of this call here: {domain}/umbraco/backoffice/UmbracoApi/Content/GetById?id={contentId}. I don't understand how the link from @sbosell can be shown at the same time as the message..

sbosell commented 3 years ago

@bergmania

The api call you asked for is attached to the ticket. I anonymized three entries in the file. Owner names and the domain was changed to mydomain.com instead of the actual domain. Also attached is a screenshot of the backend and the GetById call after publishing/fixing the node.

image GetById.txt GetById-AfterFixing-ByPublishings.txt

Zweben commented 3 years ago

Hi @Zweben.. Thansk for the detailed description. Do you have the stacktrace from the following exception System.Data.SqlClient.SqlException (0x80131904): Lock request time out period exceeded.

I do not. This has only happened on our Live site recently, where debug mode is off. Is there a way I can get a stack trace without debug mode / detailed error pages enabled?

bergmania commented 3 years ago

I do not. This has only happened on our Live site recently, where debug mode is off. Is there a way I can get a stack trace without debug mode / detailed error pages enabled?

The stack trace should still be part of the log entry in the log file..

Zweben commented 3 years ago

I do not. This has only happened on our Live site recently, where debug mode is off. Is there a way I can get a stack trace without debug mode / detailed error pages enabled?

The stack trace should still be part of the log entry in the log file..

Got it. There are 3 instances of this near the time of the issue, and one of the traces is a little different, so here are the two versions:

Stack Trace 1.txt

Stack Trace 2.txt

bergmania commented 3 years ago

Thanks @Zweben.. Both of these seems to happen from an user interaction in backoffice, where a content is published. Sadly we don't have info in the stacktrace, that could tell if it was the same page that later becomes unavailable.

kimschurmann commented 3 years ago

I think we just experienced this problem i 8.5.5 - we busted the caches and it worked again. We have two root nodes.

bergmania commented 3 years ago

I think we just experienced this problem i 8.5.5 - we busted the caches and it worked again. We have two root nodes.

Interesting, it looks like it's a problem that has been here super long. Unfortunately, it just does not make it easier to find the root cause.. We are still searching for the issue, without being able to reproduce.

kimschurmann commented 3 years ago

Honestly I dont think its a problem related to multiple root nodes - but not sure :)

bergmania commented 3 years ago

Honestly I dont think its a problem related to multiple root nodes - but not sure :)

That was just one question to narrow down the issue

Shazwazza commented 3 years ago

@sbosell re: the above question from @bergmania

For those of you, with only a single hostname assigned, could you try to remove the hostname and see if the error goes away. There's not really a reason to have just one hostname with one root node. If that 'fixes' the issue, then we know its a domain issue. That would help a lot.

Are you able to try without having a hostname assigned to your single root node? We want to see if this has anything to do with domain caches.

Shazwazza commented 3 years ago

@sbosell you mentioned in #9523 that this happens every day for you which is interesting since most others are very sporadic/random. As we need to try to reproduce but we can't, seems like you have are able to best assist. For more logging output that may help, you can add this line to your serilog.config file:

<add key="serilog:minimum-level:override:Umbraco.Web.Routing.PublishedRouter" value="Debug" />

This will log A LOT of info but if this happens everyday for you, perhaps you can enable this for a day until it happens and share the log outputs?

sbosell commented 3 years ago

@Shazwazza I enabled the serilog setting. As long as it doesn't negatively impact the site it isn't a problem.

sbosell commented 3 years ago

@Shazwazza Is there a private way I can share the file with the team?

bergmania commented 3 years ago

@Shazwazza Is there a private way I can share the file with the team?

@sbosell, feel free to send a mail to me, then I will distribute it internally to those to help on this task. bmb@umbraco.dk

orsinic commented 3 years ago

@sbosell quick question, how do you get a list of all nodes having the "This document is published but its url cannot be routed" issue?

As noted before (and as far as I know) we do not randomly get this, but only after code deployments. Being able to have a report with the impacted nodes might help digging a bit deeper or trying to get this to be reproduced.

Thanks in advance!

sbosell commented 3 years ago

@orsinic I have an uptime monitor on every page on the site but we only see the issue on one page so I immediately know when it is down. You can see that in the graphic posted above in my last message and we get notified via a Clickup Ticket (teams/slack as well) and an email.

@bergmania Sent you the log file.

Shazwazza commented 3 years ago

@sbosell thanks for enabling and sending those logs. here you've mentioned you use a custom UmbracoVirtualNodeRouteHandler. Based on your logs I'm guessing that the URL that stops working is based on this custom UmbracoVirtualNodeRouteHandler? If you can confirm - then this is the problem area for your specific issue and will help with investigation.

@Zweben here you mention you use a custom IContentFinder. Do you also have a custom url provider? Is it possible to disable these to see if the error goes away? You also mention you use HangFire - what and when exactly is this doing? You've also mentioned Uses multiple root nodes, one with hostnames assigned, three without ... in theory there should only ever be one root node without a domain else there can be problems with inbound routing a naming conflicts which can end up resulting in routing errors such as these. Can you please ensure there is only one root node without a domain (else you can have all root nodes with domains too) and see if that resolves the issue?

There are a lot of varying setups, questions and issues on this one thread and I don't really think they are all caused by the exact same thing. There's also a few suggestions above for things to try that we haven't heard back about yet. To recap:

sbosell commented 3 years ago

@Shazwazza - The node that is being derouted does not have a custom UmbracoVirtualNodeRouteHandler, but some of its children do. For instance the route in my case that is derouted is for the /locations node and all the custom route handlers are for /locations/{id}, etc but not ever for /locations.

benbracedigital commented 3 years ago

We're getting the same issue as this with many pages on a site we look after. The site is running 8.9.1. We noticed this a couple of months ago and put it down to the cache need rebuilt and/or page published.

We also thought it was a conflict between controller names with doc types (even though they were surface controllers) so we changed the name of those classes but it didn't help.

This is now happening several times a day and causing big issues for our client. The site is hosted on a virtual server (not azure or cloud based) and is not load balanced.

Any update would be much appreciated.

MartinThomasCW commented 3 years ago

Same here - we're on 8.6.6. Originally on 8.6.1 we assumed we just had an issue with the indexes (fixed in either 8.6.2 or 8.6.3), but that was just masking this problem. As per benbrace, this is a real issue for our client as the area of the site that's disappearing is the section where people have paid good money to access the content as members.

It happens sporadically - sometimes multiple times daily, then nothing for a week etc. There's no real discernible pattern - it happens when the site has users and when there's no traffic. With and without content editors changing / uploading content etc.

It's a single root node site, single language with no exotic elements at all. We did try and think laterally to temporarily work out the issue, by installing hangfire, detecting the page going down and automating an http post to the /BackOffice/Api/NuCacheStatus/ReloadCache endpoint, but to no avail. The call returns http 200, but i suspect the security context is missing to authenticate the hangfire user as admin etc. (This issue was happening well before hangfire was installed btw - we only installed it as a last gasp attempt)

If you need any information (nucache dumps, access to the site etc) please don't hesitate to reach out - the sooner we get this sorted the better for the whole community!

Many thanks

Zweben commented 3 years ago

@Zweben here you mention you use a custom IContentFinder. Do you also have a custom url provider? Is it possible to disable these to see if the error goes away? You also mention you use HangFire - what and when exactly is this doing? You've also mentioned Uses multiple root nodes, one with hostnames assigned, three without ... in theory there should only ever be one root node without a domain else there can be problems with inbound routing a naming conflicts which can end up resulting in routing errors such as these. Can you please ensure there is only one root node without a domain (else you can have all root nodes with domains too) and see if that resolves the issue?

@Shazwazza: I don't remember exactly, but I think my IContentFinder doesn't have a custom URL provider that corresponds to it (it's a bit of an unusual case), but I believe I do have a custom URL provider used elsewhere. Unfortunately, I'm not able to disable either as they're both being used in production.

Hangfire is running a few tasks: importing JSON records into nodes, and sending out a few email reports. I've checked them fairly recently and they weren't hitting any errors and didn't appear to be causing any issues.

Regarding the multiple root nodes, I wasn't aware of that... there should probably be some warning in the CMS, as it's quite easy to configure things this way. I should be able to move things around so that there is only one root node without a domain assigned. Is it safe to have a second set of domainless nodes in the root if that second set is purely "settings" nodes that have no template assigned and are therefore not routed to directly?

sbosell commented 3 years ago

Does anyone know if you can inject an IPolicyCache into an UmbracoApiController like the following and will clearing it cause the nucache to rebuild?

  private readonly IAppPolicyCache _runtimeCache;
        public ResyncController(IAppPolicyCache cache)
        {
            _runtimeCache = cache;
             // this would be called in anohter method just putting here for this post
            _runtimeCache.Clear();
        }
Shazwazza commented 3 years ago

Does anyone know if you can inject an IPolicyCache into an UmbracoApiController like the following and will clearing it cause the nucache to rebuild?

No, that is not how nucache works

Shazwazza commented 3 years ago

Regarding the multiple root nodes, I wasn't aware of that... there should probably be some warning in the CMS, as it's quite easy to configure things this way. I should be able to move things around so that there is only one root node without a domain assigned. Is it safe to have a second set of domainless nodes in the root if that second set is purely "settings" nodes that have no template assigned and are therefore not routed to directly?

The reason why multiple root nodes without domains is not supported is because you can easily end up with ambiguous route URLs based on the names of nodes. This isn't guaranteed but it can happen (have seen it often) and there is no guarantee that it will pick the 'first' ambiguous URL over the 2nd or 3rd. This has previously resulted in nodes going 'missing' until that URL cache was cleared. You can always assign dummy domains to settings root nodes, etc... This also depends on if your settings nodes have templates and/or can be routed to, what their names are , etc... My point is, that it's impossible to understand everyone's particular website so the general rule is - do not have more than one root node without a domain = safe.

Shazwazza commented 3 years ago

@MartinThomasCW this point is also interesting:

As per benbrace, this is a real issue for our client as the area of the site that's disappearing is the section where people have paid good money to access the content as members.

It seems peculiar that only this section does this. I know this thread is super long but as I've mentioned above I suspect that there can be multiple different issues here because each person's site is different. Is there something special about this section? As above, do you have custom IContentFinder, UrlProviders, UmbracoVirtualNodeRouteHandler, etc... ?