kontent-ai / delivery-sdk-net

Kontent.ai Delivery .NET SDK
https://www.nuget.org/packages/Kontent.Ai.Delivery
MIT License
32 stars 43 forks source link

Create an Asset Link URL Resolver #193

Open xantari opened 4 years ago

xantari commented 4 years ago

Motivation

Our current website serves up many PDF files. They can be easily found using google.

Such as this: https://www.google.com/search?sxsrf=ALeKk032G0vKTmeNlvlepndVXztTtzAJpg%3A1582830339114&source=hp&ei=AhNYXrCKN5nN0PEP7eyykA8&q=ARRT+rules+and+regulations&oq=ARRT+rules+and+regulations&gs_l=psy-ab.3..0.8034.11005..11150...2.0..0.209.4317.0j24j2......0....1..gws-wiz.....10..35i39j0i131j35i362i39j0i10j0i22i30j0i22i10i30.0Ws3sH2JO30&ved=0ahUKEwiwk9zAtvLnAhWZJjQIHW22DPIQ4dUDCAg&uact=5

You will see that one of the links is this: https://www.arrt.org/docs/default-source/governing-documents/arrt-rules-and-regulations.pdf?sfvrsn=3f9e02fc_42

As you can see it is being served up from our domain.

However, it seems that with Kentico Kontent we do not have flexibility of defining an Asset Link URL Resolver.

We can create Content Item Resolvers, but not Asset Link Resolvers.

Instead what is being surfaced right now in the above example might be something like this:

https://preview-assets-us-01.kc-usercontent.com:443/406ac8c6-58e8-00b3-e3c1-0c312965deb2/ba8df0e1-69da-44f4-b197-9aceafb39979/arrt-rules-and-regulations.pdf

This has some negative ramifications for our business as follows:

  1. Google will drop the PDF from their index and people will no longer be able to find it from our domain.
  2. Google might index it, but users will think it is from an unofficial source (kc-usercontent.com) domain.
  3. When we get around to building our search of our website, will we find the content since it's not from our domain (probably a search engine setup specific issue)

We need the ability to define an asset link resolver in the Kentico client builder registration.

Right now the approach we are attempting is the following:

  1. Create .NET Core middleware that intercepts the response body content and does string substitution.
            app.Use(async (context, next) =>
            {
                var body = context.Response.Body;

                using (var updatedBody = new MemoryStream())
                {
                    context.Response.Body = updatedBody;

                    await next();

                    context.Response.Body = body;

                    updatedBody.Seek(0, SeekOrigin.Begin);
                    var newContent = new StreamReader(updatedBody).ReadToEnd();

                    await context.Response.WriteAsync(
                        newContent.Replace("preview-assets-us-01.kc-usercontent.com:443", "localdevwwwcore.arrt.org/Assets"));
                }
            });

Is this the best way to do it for now?

Not sure on performance or if those kc-usercontent.com links can dynamically change and we need to define a whole list of potential CDN servers?

Proposed solution

Introduce Asset Link Url Resolver, much like the Content Item resolver you already have.

Alternative solutions:

  1. Specify the URL link in Kentico.Ai portal to define this information?
  2. Another alternative is to allow customers to map their own domain to your CDN servers. For instance we could map https://assets.arrt.org to resolve to https://preview-assets-us-01.kc-usercontent.com:443/406ac8c6-58e8-00b3-e3c1-0c312965deb2/ba8df0e1-69da-44f4-b197-9aceafb39979/ so that what is returned in your JSON is already pointing to the correct location
xantari commented 4 years ago

Update. I created a middleware as follows:

   /// <summary>
    /// Asset Link Middleware
    /// 2/27/2020 - MRO: Custom middleware that alters the response body to make the kentico assets show up from our own domain.
    //  See: https://github.com/Kentico/kontent-delivery-sdk-net/issues/193
    //  https://jeremylindsayni.wordpress.com/2019/02/18/adding-middleware-to-your-net-core-mvc-pipeline-that-formats-and-indents-html-output/
    //  https://docs.microsoft.com/en-gb/aspnet/core/fundamentals/middleware/index?view=aspnetcore-2.2
    /// </summary>
    public class AssetLinkMiddleware
    {
        private readonly RequestDelegate _next;
        private readonly ILogger<AssetLinkMiddleware> _logger;

        public AssetLinkMiddleware(RequestDelegate next, ILogger<AssetLinkMiddleware> logger)
        {
            _next = next;
            _logger = logger;
        }

        public async Task InvokeAsync(HttpContext httpContext, IOptions<ProjectOptionsBase> projectOptions)
        {
            var body = httpContext.Response.Body;

            using (var updatedBody = new MemoryStream())
            {
                httpContext.Response.Body = updatedBody;

                _logger.LogInformation("here {@httpContext}", httpContext.Response.ContentType);
                await _next(httpContext);

                _logger.LogInformation("after next {@httpContext}", httpContext.Response.ContentType);
                httpContext.Response.Body = body;

                updatedBody.Seek(0, SeekOrigin.Begin);

                if (httpContext.Response.ContentType.Contains("text/html")) //Only manipulate the html data coming back, ignore all other file types (such as jpg, png, etc)
                {

                    var newContent = new StreamReader(updatedBody).ReadToEnd();

                    string assetUrl = projectOptions.Value.AssetUrl;
                    string[] kenticoCdnUrls = projectOptions.Value.KenticoAssetCDNUrls;

                    for (int i = 0; i < kenticoCdnUrls.Length; i++)
                        newContent = newContent.Replace(kenticoCdnUrls[i], assetUrl);

                    await httpContext.Response.WriteAsync(newContent);
                }
                else //Return the asset bytes (jpg, pdf, other binary content)
                {
                    await updatedBody.CopyToAsync(httpContext.Response.Body);
                }

            }
        }
    }

Registered in Startup.cs as follows. The two key points are the registration of the AssetLinkMiddleware and the Asset link controller routing.

 if (env.IsDevelopment())
            {
                app.UseDeveloperExceptionPage();
            }
            else
            {
                app.UseStatusCodePagesWithReExecute("/Error/{0}");
                app.UseExceptionHandler("/Error/500"); //Internal server error occurred
                // The default HSTS value is 30 days. You may want to change this for production scenarios, see https://aka.ms/aspnetcore-hsts.
                app.UseHsts();
            }

            // Add IIS URL Rewrite list
            // See https://docs.microsoft.com/en-us/aspnet/core/fundamentals/url-rewriting
            var options = new RewriteOptions().AddIISUrlRewrite(env.ContentRootFileProvider, "IISUrlRewrite.xml");
            app.UseRewriter(options);

            app.UseHttpsRedirection();

            // Write streamlined request completion events, instead of the more verbose ones from the framework.
            // To use the default framework request logging instead, remove this line and set the "Microsoft"
            // level in appsettings.json to "Information".
            app.UseSerilogRequestLogging();

            app.UseStaticFiles();
            app.UseRouting();

            app.UseWhen(context => context.Request.Path.StartsWithSegments("/webhooks/webhooks", StringComparison.OrdinalIgnoreCase), appBuilder =>
            {
                appBuilder.UseMiddleware<SignatureMiddleware>();
            });

            app.UseMiddleware<AssetLinkMiddleware>();

            app.UseEndpoints(endpoints =>
            {
                endpoints.MapControllerRoute(
                    name: "areas",
                    pattern: "{area:exists}/{controller=Home}/{action=Index}/{id?}");

                endpoints.MapControllerRoute(
                    name: "sitemap",
                    defaults: new { controller = "Sitemap", action = "Index" },
                    pattern: "sitemap.xml");

                endpoints.MapControllerRoute(
                    name: "preview",
                    pattern: "preview/{*urlSlug}", defaults: new { controller = "Preview", action = "Index" });

                endpoints.MapControllerRoute(
                     name: "news",
                     pattern: "news/{action=Index}/{page?}", defaults: new { controller = "News", action = "Index" });

                endpoints.MapControllerRoute(
                    name: "assets",
                    pattern: "assets/{**urlFragment}", defaults: new { controller = "Assets", action = "Index" });

                endpoints.MapControllerRoute(
                    name: "default",
                    pattern: "{controller=LandingPage}/{action=Index}/{codename?}");

                endpoints.MapDynamicControllerRoute<SitemapRouteTransformer>("pages/{**slug}");
            });

My ProjectOptionsBase:

    public class ProjectOptionsBase
    {
        .....
        .....
        public string[] KenticoAssetCDNUrls { get; set; }
        public string AssetUrl { get; set; }
    }

My appsettings.config that maps to my project options base:

  "KenticoAssetCDNUrls": [
    "preview-assets-us-01.kc-usercontent.com:443",
    "assets-us-01.kc-usercontent.com:443"
  ],
  "AssetUrl": "localdevwwwcore.arrt.org/Assets",
petrsvihlik commented 4 years ago

So let me summarize the requirements and let's see if I understand the problem correctly:

  1. You want to serve assets from your own domain (arrt.org) for SEO reasons (basically)
  2. You want a single source of truth for PDFs.
  3. You want to be able to work with the assets in Kontent - use them in rich text, etc.
  4. You want to store the PDFs in Kontent.

Solutions: You suggested creating an Asset Link Url Resolver. This is only part of the solution as it'd only fix the links in the rich text elements but the PDFs still wouldn't be served from the correct domain.

Since Kontent doesn't allow mapping custom domains to the CDN servers (which is something that you can request in the roadmap) you'd need to create a proxy/middleware that would translate arrt.org links to CDN links or that would serve the PDFs directly from the arrt.org domain (by downloading the content from the CDN first - you'd probably lose some performance advantage here). In order to create such a translation table, you'd need to identify the asset in question to get its URL. To do that, you can utilize the Management API's Asset endpoint or traverse through all content items using the items-feed endpoint. I'd go for the former as the latter is more performance demanding. The same algorithm should then be used in the Asset Link Url Resolver to create the hrefs.

It is possible to bypass all this by creating a wrapping content type called "PDF" with fields like "Name", "URL slug", and "Asset". This way, you can easily render the rich text with correct links + query PDF assets by "URL slug" (which could be extracted from a route and passed to the controller).

Similarly, if (4) is not true and you want to serve the PDF from where they already are you can create a custom content type called "PDF" with "Name" and "URL" elements and link them from your rich text fields.

One more question for you: do you want to preserve the current URLs of the assets? Because if your current route pattern is not deterministic/predictable this can be potentially quite complicated.

xantari commented 4 years ago

The middleware I posted has actually solved this issue for us. (FYI, I updated it above since it had a bug with binary data, made it target only text/html now and that seems to fix it).

What it does is the following:

  1. It rewrites the HttpContext.Response.Body by looking for all instances of the following:

preview-assets-us-01.kc-usercontent.com:443 assets-us-01.kc-usercontent.com:443

It then remaps the that to this URL:

localdevwwwcore.arrt.org/Assets

This makes the following URL:

https://preview-assets-us-01.kc-usercontent.com:443/406ac8c6-58e8-00b3-e3c1-0c312965deb2/ba8df0e1-69da-44f4-b197-9aceafb39979/arrt-rules-and-regulations.pdf

Turn into this:

https://localdevwwwcore.arrt.org/406ac8c6-58e8-00b3-e3c1-0c312965deb2/ba8df0e1-69da-44f4-b197-9aceafb39979/arrt-rules-and-regulations.pdf

Then our AssetsController.cs:

    public class AssetsController : BaseController<AssetsController>
    {
        private readonly IOptions<ProjectOptions> _options;
        private readonly IAssetService _assetService;
        public AssetsController(IDeliveryClient deliveryClient, ILogger<AssetsController> logger, IOptions<ProjectOptions> options,
            IAssetService assetService) : base(deliveryClient, logger)
        {
            _options = options;
            _assetService = assetService;
        }

        public async Task<IActionResult> Index(string urlFragment)
        {
            //Check if the binary data is in our distributed cache based off of the unique url Id
            Logger.LogInformation("Retrieving asset URL: {urlFragment}", urlFragment);

            //TODO: Check if this data is in the distributed cache to avoid a call to Kentico Kontent
            var asset = await _assetService.GetKenticoAsset(_options.Value.KenticoAssetCDNUrls, urlFragment);

            //If not in there, go fetch the binary data using an HttpClient and put into cache 
            System.IO.File.WriteAllBytes("C:\\Temp\\" + asset.FileName, asset.FileData);

            //Return the file result data
            if (asset != null)
                return File(asset.FileData, asset.ContentType, asset.FileName);

            Logger.LogInformation("Could not find Kentico Asset at URL: {urlFragment}", urlFragment);
            return new NotFoundResult();
        }
    }

Code of GetKenticoAsset:

 /// <summary>
    /// Kentico Asset Service
    /// Provides a way to get information about kentico assets to serve up from our own domain
    /// </summary>
    /// <remarks>
    /// 2/27/2020 - MRO: Initial creation
    /// </remarks>
    public class AssetService : IAssetService
    {
        private readonly HttpClient _httpClient;
        private readonly IMimeMappingService _mimeMappingService;

        public AssetService(HttpClient httpClient, IMimeMappingService mimeMappingService)
        {
            _httpClient = httpClient;
            _mimeMappingService = mimeMappingService;
        }

        public async Task<KenticoAsset> GetKenticoAsset(string[] cdnUrls, string urlFragement)
        {
            var asset = new KenticoAsset();

            //Build all the original possible kentico cdn Urls
            List<string> cdnFullUrls = new List<string>();
            foreach (var itm in cdnUrls)
            {
                cdnFullUrls.Add($"https://{itm}/{urlFragement}");
            }

            byte[] data = await _httpClient.GetByteArrayAsync(cdnFullUrls[0]);

            string fileName = Path.GetFileName(urlFragement);

            asset.ContentType = _mimeMappingService.Map(fileName);
            asset.FileData = data;
            asset.FileName = fileName;
            asset.KenticoUrls = cdnFullUrls.ToArray();

            return asset;
        }
    }

    public interface IAssetService
    {
        Task<KenticoAsset> GetKenticoAsset(string[] cdnUrls, string urlFragement);
    }

I hope this make sense what I'm asking for. Either a Asset Link resolver that doesn't make me write all this code to re-write the HttpContext.Response.Body or something like my alternative solution (shown below).

Here is a mockup of a proposal for Alternative Solution # 1 above (Specify the URL link in Kentico.Ai portal to define this information?)

So when the JSON comes back it's already altered to go to the correct place. We would then just go construct the CDN url and fetch it with our own HttpClient. Asset Url Rewrite is the box I mocked up in the below image :-)

image

I really like the concept of the Alternative Solution # 1 I proposed as it requires almost no changes in your API's, just the URL's that are currently being surfaced from your API. Least amount of code changes I would think doing that route.

xantari commented 4 years ago

BTW, I updated the code in my replies above, so please re-review.

xantari commented 4 years ago

Here is information that might be helpful for Alternative solution # 2 (Using DNS CNAMES to point to your Fastly CDN): https://docs.fastly.com/en/guides/working-with-domains

If you surfaced a way to map our domain to your CDN in app.kontent.ai settings page perhaps that is another way to solve this. Then we could use your CDN servers (which would be ideal) and we still would get the benefit of the content appearing to come from our domain (assets.arrt.org for example)

lukasturek commented 4 years ago

Hi, I did some investigation on how other customers build their applications with custom domains for assets. They mentioned using a CDN and setting the rules for caching to save some cross-site requests and traffic. At the moment, building a middleware service, as you did, would be the best way to do asset white labeling. By the way, since there are already some requests for this feature, we've added it to considering section on our roadmap. You can vote there for features that are important for you as well as submit your own ideas.

Best regards,

Lukas Turek Product Manager

xantari commented 4 years ago

@lukasturek I tried finding that feature request on your roadmap but couldn't find it. The link appears to be broken.

Simply007 commented 4 years ago

@lukasturek I tried finding that feature request on your roadmap but couldn't find it. The link appears to be broken.

I have adjusted the link

xantari commented 4 years ago

FYI, we had to back out the middleware we were using. I couldn't get it to work 100% of the time (weird blank page issue randomly) when used in conjunction with Microsoft's response caching middleware :(

petrsvihlik commented 4 years ago

So you're back with the CDN links, right? We have no immediate plan to address this, we first need to stabilize the v14 release. I guess we can take a look at this one then. Could you perhaps draft a PR addressing the issue here? It might speed up things.

(Sorry for my late responses...We just had a baby so I was busy changing dipers :))

xantari commented 4 years ago

Congrats!

Yeah we are using the cdn links now. You want a pull request containing what?

petrsvihlik commented 4 years ago

thanks! :)

PR of the suggested "Asset Link Url Resolver"...if that's what would help you solve the issue (e.g. replace the current middleware). We can at least start iterating from there to see if it's a way to go.

xantari commented 4 years ago

So I played with this a bit more and I can't repro it anymore (atleast locally). It may have something to do with our hardware load balancer as well. So, more investigation is needed.

petrsvihlik commented 4 years ago

ok. I'll leave this open and wait for more input.