Open jmacgreg opened 4 years ago
The 5mb is a limit set by Google?
Looking at the OJS sitemap code, it creates links to both landing pages and single galleys.
foreach($submissionsIterator as $submission) {
// Abstract
$root->appendChild($this->_createUrlTree($doc, $request->url($journal->getPath(), 'article', 'view', array($submission->getBestId()))));
// Galley files
$galleys = $galleyDao->getByPublicationId($submission->getCurrentPublication()->getId());
while ($galley = $galleys->next()) {
$root->appendChild($this->_createUrlTree($doc, $request->url($journal->getPath(), 'article', 'view', array($submission->getBestId(), $galley->getBestGalleyId()))));
}
}
is this what we want for OPS as well?
Regarding sitemap size: I was told by the Scholar folks in our call that the recommendation is no bigger than 5MB, but they didn't provide any citation for that. I can ask. I see lots of conflicting info online about how big they can be, but maybe the better way to think of them is how many total URLs they can include, which seems to be max 50,000 URLs/file: https://stackoverflow.com/questions/2887358/limitation-for-google-sitemap-xml-file-size.
Regarding the code: I am actually not seeing individual article or galley links in our OJS 3 demo: https://demo.publicknowledgeproject.org/ojs3/demo/index.php/manuscript/sitemap. Only an issue link. I think having a link to the article landing page is fine - a galley-level link is unnecessary.
Probably a bug in the code then. At least someone was planning to show those url's in the sitemap. Not sure if it is a good thing or a bad thing that the links are missing though...
I think it would be reasonable to include them even for OJS (ie. fix the bug, if that is what it is), but maybe don't push the fix until there's some sort of paging solution in place at the same time. Does that make sense as an approach?
I just saw the sitemap isn't on the robots.txt and that they are generated dynamically, and landed on a couple of issues in GitHub.
Heavy journals might have a hard time generating them, so it definitely makes sense to add cache and pagination.
This is more an OPS issue than an OJS/OMP one, but it could be done across the board. If the application has a very large amount of content, the resulting XML sitemap may be too large (anything over 5MB is too large). The top-level sitemap can link out to child sitemaps - see https://dspace.mit.edu/sitemap for an example. This should be considered especially in the OPS context, where all articles should be listed in the sitemap.