pkp / pkp-lib

The library used by PKP's applications OJS, OMP and OPS, open source software for scholarly publishing.
https://pkp.sfu.ca
GNU General Public License v3.0
299 stars 443 forks source link

For sites with very large content, split sitemap into multiple sitemaps #5655

Open jmacgreg opened 4 years ago

jmacgreg commented 4 years ago

This is more an OPS issue than an OJS/OMP one, but it could be done across the board. If the application has a very large amount of content, the resulting XML sitemap may be too large (anything over 5MB is too large). The top-level sitemap can link out to child sitemaps - see https://dspace.mit.edu/sitemap for an example. This should be considered especially in the OPS context, where all articles should be listed in the sitemap.

ajnyga commented 4 years ago

The 5mb is a limit set by Google?

ajnyga commented 4 years ago

Looking at the OJS sitemap code, it creates links to both landing pages and single galleys.

                foreach($submissionsIterator as $submission) {
                    // Abstract
                    $root->appendChild($this->_createUrlTree($doc, $request->url($journal->getPath(), 'article', 'view', array($submission->getBestId()))));
                    // Galley files
                    $galleys = $galleyDao->getByPublicationId($submission->getCurrentPublication()->getId());
                    while ($galley = $galleys->next()) {
                        $root->appendChild($this->_createUrlTree($doc, $request->url($journal->getPath(), 'article', 'view', array($submission->getBestId(), $galley->getBestGalleyId()))));
                    }
                }

is this what we want for OPS as well?

jmacgreg commented 4 years ago

Regarding sitemap size: I was told by the Scholar folks in our call that the recommendation is no bigger than 5MB, but they didn't provide any citation for that. I can ask. I see lots of conflicting info online about how big they can be, but maybe the better way to think of them is how many total URLs they can include, which seems to be max 50,000 URLs/file: https://stackoverflow.com/questions/2887358/limitation-for-google-sitemap-xml-file-size.

Regarding the code: I am actually not seeing individual article or galley links in our OJS 3 demo: https://demo.publicknowledgeproject.org/ojs3/demo/index.php/manuscript/sitemap. Only an issue link. I think having a link to the article landing page is fine - a galley-level link is unnecessary.

ajnyga commented 4 years ago

Probably a bug in the code then. At least someone was planning to show those url's in the sitemap. Not sure if it is a good thing or a bad thing that the links are missing though...

jmacgreg commented 4 years ago

I think it would be reasonable to include them even for OJS (ie. fix the bug, if that is what it is), but maybe don't push the fix until there's some sort of paging solution in place at the same time. Does that make sense as an approach?

jonasraoni commented 5 months ago

I just saw the sitemap isn't on the robots.txt and that they are generated dynamically, and landed on a couple of issues in GitHub.

Heavy journals might have a hard time generating them, so it definitely makes sense to add cache and pagination.