prestaconcept / PrestaSitemapBundle

A symfony bundle that provides tools to build a rich application sitemap. The main goals are : simple, no databases, various namespace (eg. google image), respect constraints etc.
MIT License
347 stars 100 forks source link

Multi-domain errors cause sitemapindex XML confusion #305

Open NiklasBr opened 1 year ago

NiklasBr commented 1 year ago

PHP version(s) affected: 8.1.13

Package version(s) affected: 3.3.0

Description
With a Symfony 5.4-based application, multiple sites with separate domains share a /public directory. For example:

For each of these sites we run the following command (manually or via cron)

bin/console presta:sitemaps:dump --section site_1 --base-url https://1.example.com/ var/tmp/sitemaps
bin/console presta:sitemaps:dump --section site_2 --base-url https://2.example.com/ var/tmp/sitemaps
bin/console presta:sitemaps:dump --section site_3 --base-url https://3.example.com/ var/tmp/sitemaps

Now, after the first command for --section site_1 has been completed the XML is updated as expected:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://1.example.com/sitemap.site_1.xml</loc>
    <lastmod>2023-01-18T16:20:49+01:00</lastmod>
  </sitemap>
</sitemapindex>

And then after the second command, for --section site_2, has completed, all domains change in the index XML file, the content of the urlset https://2.example.com/sitemap.site_2.xml is correct, it has the correct base URL:s for all locations. But the index XML changes all URL:s.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://2.example.com/sitemap.site_2.xml</loc>
    <lastmod>2023-01-18T16:23:22+01:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://2.example.com/sitemap.site_1.xml</loc>
    <lastmod>2023-01-18T16:20:49+01:00</lastmod>
  </sitemap>
</sitemapindex>

And then after the second command, for --section site_3, has completed, all domains change in the index XML file, the content of the urlset https://3.example.com/sitemap.site_3.xml is correct, it has the correct base URL:s for all locations. But the index XML changes all URL:s.

<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://3.example.com/sitemap.site_3.xml</loc>
    <lastmod>2023-01-18T16:27:28+01:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://3.example.com/sitemap.site_2.xml</loc>
    <lastmod>2023-01-18T16:23:22+01:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://3.example.com/sitemap.site_1.xml</loc>
    <lastmod>2023-01-18T16:20:49+01:00</lastmod>
  </sitemap>
</sitemapindex>

Now, to where the error occurs, when starting over with the commands, e.g. the next day to periodically regenerate the files, the new one gets added on top of the previous ones:

<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://1.example.com/sitemap.site_1.xml</loc>
    <lastmod>2023-01-18T16:33:48+01:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://1.example.com/sitemap.site_3.xml</loc>
    <lastmod>2023-01-18T16:27:28+01:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://1.example.com/sitemap.site_2.xml</loc>
    <lastmod>2023-01-18T16:23:22+01:00</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://1.example.com/sitemap.site_1.xml</loc>
    <lastmod>2023-01-18T16:20:49+01:00</lastmod>
  </sitemap>
</sitemapindex>

How to reproduce
I think the full description above should do it.

Possible Solution
Maybe tag each <sitemap> in the index XML with the specific section, such as <sitemap id="site_1"> instead and use that to identify whether or not to update/add to the file?

Additional Context
n/a