Open kelson42 opened 3 years ago
Is this issue based around this code, or something different? https://github.com/openzim/zim-tools/blob/master/src/zimwriterfs/tools.cpp#L253-L259
auto index_of_last_dot = filename.find_last_of(".");
if (index_of_last_dot != std::string::npos) {
mimeType = filename.substr(index_of_last_dot + 1);
try {
return extMimeTypes.at(mimeType);
} catch (std::out_of_range&) {}
}
While trying to build a new zim file of stackoverflow it looked like the mimeType list being generated was way too long and the ASSERT at the end just fails and kills the build..
There are lots of links to sites like yuml.me where people are using long URL's that include dots to produce UML diagrams. Most of the ridiculously long lines here are from yuml.me in this comment. These URL's end up serving an svg image. https://github.com/openzim/zim-tools/issues/126#issuecomment-718119293
At least two of the numeric strings are lat/long from google maps.
Many of the others are obvious typos or copy/paste errors.
Looking at the list I didn't really see anything legitimate under 3 or more than 4 characters so I patched it like so to let it build. Without a doubt that still leaves a lot of garbage, but I can at least build stackoverflow. Maybe there is a better approach.
diff --git a/src/zimwriterfs/tools.cpp b/src/zimwriterfs/tools.cpp
index 06aa8ce..df6f9b2 100644
--- a/src/zimwriterfs/tools.cpp
+++ b/src/zimwriterfs/tools.cpp
@@ -276,9 +276,11 @@ std::string getMimeTypeForFile(const std::string &directoryPath, const std::stri
auto index_of_last_dot = filename.find_last_of(".");
if (index_of_last_dot != std::string::npos) {
mimeType = filename.substr(index_of_last_dot + 1);
- try {
- return extMimeTypes.at(mimeType);
- } catch (std::out_of_range&) {}
+ if (mimeType.length() >2 && mimeType.length() <= 4) {
+ try {
+ return extMimeTypes.at(mimeType);
+ } catch (std::out_of_range&) {}
+ }
}
This ticket is not a bug, this is a feature request. I believe your comment to be unrealted to this.
The bug is more that we should not use the extension as mimetype (if we cannot found the mimetype). The code to change is more at the end of the function than checking the length of the extension.
@jmontleon if you have a working copy of a recent stackoverflow dump, would love to get a hold of it :)