openzim / zim-tools

Various ZIM command line tools
https://download.openzim.org/release/zim-tools/
GNU General Public License v3.0
118 stars 34 forks source link

Check if filename extensions are coherent with mime-type #187

Open kelson42 opened 3 years ago

kelson42 commented 3 years ago
jmontleon commented 3 years ago

Is this issue based around this code, or something different? https://github.com/openzim/zim-tools/blob/master/src/zimwriterfs/tools.cpp#L253-L259

  auto index_of_last_dot = filename.find_last_of(".");
  if (index_of_last_dot != std::string::npos) {
    mimeType = filename.substr(index_of_last_dot + 1);
    try {
      return extMimeTypes.at(mimeType);
    } catch (std::out_of_range&) {}
  }

While trying to build a new zim file of stackoverflow it looked like the mimeType list being generated was way too long and the ASSERT at the end just fails and kills the build..

There are lots of links to sites like yuml.me where people are using long URL's that include dots to produce UML diagrams. Most of the ridiculously long lines here are from yuml.me in this comment. These URL's end up serving an svg image. https://github.com/openzim/zim-tools/issues/126#issuecomment-718119293

At least two of the numeric strings are lat/long from google maps.

Many of the others are obvious typos or copy/paste errors.

Looking at the list I didn't really see anything legitimate under 3 or more than 4 characters so I patched it like so to let it build. Without a doubt that still leaves a lot of garbage, but I can at least build stackoverflow. Maybe there is a better approach.

diff --git a/src/zimwriterfs/tools.cpp b/src/zimwriterfs/tools.cpp
index 06aa8ce..df6f9b2 100644
--- a/src/zimwriterfs/tools.cpp
+++ b/src/zimwriterfs/tools.cpp
@@ -276,9 +276,11 @@ std::string getMimeTypeForFile(const std::string &directoryPath, const std::stri
   auto index_of_last_dot = filename.find_last_of(".");
   if (index_of_last_dot != std::string::npos) {
     mimeType = filename.substr(index_of_last_dot + 1);
-    try {
-      return extMimeTypes.at(mimeType);
-    } catch (std::out_of_range&) {}
+    if (mimeType.length() >2 && mimeType.length() <= 4) {
+      try {
+        return extMimeTypes.at(mimeType);
+      } catch (std::out_of_range&) {}
+    }
   }
kelson42 commented 3 years ago

This ticket is not a bug, this is a feature request. I believe your comment to be unrealted to this.

mgautierfr commented 3 years ago

The bug is more that we should not use the extension as mimetype (if we cannot found the mimetype). The code to change is more at the end of the function than checking the length of the extension.

venkatd commented 2 years ago

@jmontleon if you have a working copy of a recent stackoverflow dump, would love to get a hold of it :)