prestaconcept / PrestaSitemapBundle

A symfony bundle that provides tools to build a rich application sitemap. The main goals are : simple, no databases, various namespace (eg. google image), respect constraints etc.
MIT License
347 stars 100 forks source link

Wrong Domain in Index Sitemap if multiple Tenants are in use #310

Closed rliebi closed 12 months ago

rliebi commented 1 year ago

PHP version(s) affected: 8.1.20

Package version(s) affected: 3.3.1

Description
We have multiple Tenants with multiple locales.

Tenant A
    -- EN_US
    -- DE_DE
    -- ES_ES
Tenant B
    -- EN_US
    -- DE_DE
    -- ES_ES
Tenant C
    -- EN_US

To write our sitemap we iterate over our tenants and dump into a tenant specific directory. This works fine until a Tenant has no more then one locale. In that case the base_url of Tenant C is the base_url of Tenant B

How to reproduce

we did something like this:

        $tenants = ['A' => ['de_ch', 'en_us'], 'B' => ['de_ch', 'en_us'], 'C' => ['de_ch']];
        foreach ($tenants as $tenant => $locales) {
            foreach ($locales as $locale) {
                // add all relevant urls
                $this->dumper->addUrl(
                    new UrlConcrete(
                        sprintf(
                            'https://%s.%s',
                            $tenant,
                            TenantService::getActiveSiteRaw()->getMainDomain()
                        )
                    ),
                    $locale
                );

            }
            $this->dumper->dump(
                '/var/www/html/web/sitemap/'.$tenant,
                sprintf('https://%s', TenantService::getActiveSiteRaw()->getMainDomain()),
                $locale
            );
        }

Possible Solution
the error does not occur if we do a dummy dump first

            $this->dumper->dump('/tmp/dispose', sprintf('https://%s', TenantService::getActiveSiteRaw()->getMainDomain()), 'a');
yann-eugone commented 12 months ago

Hello

I did a simple test script using your script

<?php

require __DIR__.'/vendor/autoload.php';

use Presta\SitemapBundle\Service\Dumper;
use Presta\SitemapBundle\Sitemap\Url\UrlConcrete;
use Symfony\Component\EventDispatcher\EventDispatcher;
use Symfony\Component\Filesystem\Filesystem;
use Symfony\Component\Routing\Loader\ClosureLoader;
use Symfony\Component\Routing\Router;

$eventDispatcher = new EventDispatcher();
$filesystem = new Filesystem();
$router = new Router(new ClosureLoader(), null);
$dumper = new Dumper($eventDispatcher, $filesystem, 'sitemap', 5, $router);

$tenants = ['A' => ['de_ch', 'en_us'], 'B' => ['de_ch', 'en_us'], 'C' => ['de_ch']];
foreach ($tenants as $tenant => $locales) {
    foreach ($locales as $locale) {
        // add all relevant urls
        $dumper->addUrl(
            new UrlConcrete(
                sprintf(
                    'https://%s.%s',
                    $tenant,
                    'tenant-main-domain'
                )
            ),
            $locale
        );
    }
    $dumper->dump(
        __DIR__ . '/' . $tenant,
        sprintf('https://%s', 'tenant-main-domain'),
        $locale
    );
}

And all the dumped sitemaps looks OK : sitemaps.zip

Is there any chance the issue is coming from your TenantService::getActiveSiteRaw()->getMainDomain() ?

rliebi commented 12 months ago

Hey @yann-eugone Thanks for testing this!

They do not really seem to be OK, it seems the problem occurs in a way too with your code example. Have a look at the different sitemap.xml files: Tenant A:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd"
              xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>sitemap.de_ch.xml</loc>
        <lastmod>2023-06-30T13:33:40+00:00</lastmod>
    </sitemap>
    <sitemap>
        <loc>sitemap.en_us.xml</loc>
        <lastmod>2023-06-30T13:33:40+00:00</lastmod>
    </sitemap>
</sitemapindex>

Tenant B:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd"
              xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>https://tenant-main-domain/sitemap.de_ch.xml</loc>
        <lastmod>2023-06-30T13:33:40+00:00</lastmod>
    </sitemap>
    <sitemap>
        <loc>https://tenant-main-domain/sitemap.en_us.xml</loc>
        <lastmod>2023-06-30T13:33:40+00:00</lastmod>
    </sitemap>
</sitemapindex>

Tenant C:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd"
              xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>https://tenant-main-domain/sitemap.de_ch.xml</loc>
        <lastmod>2023-06-30T13:33:40+00:00</lastmod>
    </sitemap>
</sitemapindex>

They are at best inconsistent and the domain does not match the domain defined. TENANT => SHOULD ;; IS A => A.tenant-main-domain ;; / B => B.tenant-main-domain ;; tenant-main-domain C => C.tenant-main-domain ;; tenant-main-domain

I just saw, that you did not include the subdomain in the dump part. If you change your line

$dumper->dump(
        __DIR__ . '/' . $tenant,
        sprintf('https://%s', 'tenant-main-domain'),
        $locale
    );

to

$dumper->dump(
                __DIR__ . '/' . $tenant,
                sprintf(
                    'https://%s.%s',
                    $tenant,
                    'tenant-main-domain'
                ),
                $locale
            );

The output is this: sitemap.zip

TENANT => SHOULD ;; IS A => A.tenant-main-domain ;; / B => B.tenant-main-domain ;; A.tenant-main-domain C => C.tenant-main-domain ;; B.tenant-main-domain

Where as in our Project I came along this output: A => A.tenant-main-domain ;; A.tenant-main-domain B => B.tenant-main-domain ;; B.tenant-main-domain C => C.tenant-main-domain ;; B.tenant-main-domain

regards Remo

yann-eugone commented 12 months ago

I think I get why we have this issue Base URL is initialized in Dumper::dump, and used when we create a new Urlset Dumper::newUrlset Because in the script we are adding URLs before the dump is called, we have no base url set when we are adding the first urls

I think you should try to add your URLs using an event listener, because in that case, dump starts first and base url will be defined.

Something like this could work (but you should pay attention to what URLs you should add : only those of the tenant that is currently being dumped)

<?php

require __DIR__ . '/vendor/autoload.php';

use Presta\SitemapBundle\Event\SitemapPopulateEvent;
use Presta\SitemapBundle\Service\Dumper;
use Presta\SitemapBundle\Sitemap\Url\UrlConcrete;
use Symfony\Component\EventDispatcher\EventDispatcher;
use Symfony\Component\Filesystem\Filesystem;
use Symfony\Component\Routing\Loader\ClosureLoader;
use Symfony\Component\Routing\Router;

$eventDispatcher = new EventDispatcher();
$filesystem = new Filesystem();
$router = new Router(new ClosureLoader(), null);
$dumper = new Dumper($eventDispatcher, $filesystem, 'sitemap', 5, $router);

$eventDispatcher->addListener(
    SitemapPopulateEvent::ON_SITEMAP_POPULATE,
    function (SitemapPopulateEvent $event) {
        $tenants = ['A' => ['de_ch', 'en_us'], 'B' => ['de_ch', 'en_us'], 'C' => ['de_ch']];
        foreach ($tenants as $tenant => $locales) {
            foreach ($locales as $locale) {
                // add all relevant urls
                $event->getUrlContainer()->addUrl(
                    new UrlConcrete(
                        sprintf(
                            'https://%s.%s',
                            $tenant,
                            'tenant-main-domain'
                        )
                    ),
                    $locale
                );
            }
        }
    }
);

$tenants = ['A', 'B', 'C'];

foreach ($tenants as $tenant) {
    $dumper->dump(
        __DIR__ . '/sitemaps/' . $tenant,
        sprintf(
            'https://%s.%s',
            $tenant,
            'tenant-main-domain'
        )
    );
}
rliebi commented 12 months ago

Thank you for your workaround. For now I'll just do a "dummy" dump before adding URLs. But the event listener approach seems cleaner. I'll keep it in mind.