openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
19 stars 16 forks source link

Wrong mimetype for CSS file #38

Closed satyamtg closed 4 years ago

satyamtg commented 4 years ago

This file in the PHZH zim made during this run in the ZimFarm has a wrong mimetype and hence we get the following error message in the console -

http://localhost:5232/phzh_core-english-one_en_2020-08/-/instance_assets/lms-style-vendor.68e48093f5dd.css

The file shall be detected as text/css to work but rather gets the mimetype as text/troff. This is basically due to the magic output where we get the following for this file -

>>> magic.detect_from_filename("lms-style-vendor.68e48093f5dd.css")
FileMagic(mime_type='text/troff', encoding='us-ascii', name='troff or preprocessor input, ASCII text, with very long lines')
rgaudin commented 4 years ago

Why are we using magic for those files? Wouldn't it be faster to use the filenames? Also, magic is notoriously poor quality with text files (no magic number obviously!)

satyamtg commented 4 years ago

Why are we using magic for those files? Wouldn't it be faster to use the filenames? Also, magic is notoriously poor quality with text files (no magic number obviously!)

That's due to these lines -https://github.com/openzim/python_scraperlib/blob/master/src/zimscraperlib/zim/filesystem.py#L66-L70

I think what we shall use the filename based guess if text and not just text/plain is present in the magic mime.

rgaudin commented 4 years ago

Why are we using magic for those files? Wouldn't it be faster to use the filenames? Also, magic is notoriously poor quality with text files (no magic number obviously!)

That's due to these lines -https://github.com/openzim/python_scraperlib/blob/master/src/zimscraperlib/zim/filesystem.py#L66-L70

I think what we shall use the filename based guess if text and not just text/plain is present in the magic mime.

Exactly, we should use if self.mime_type.startswith("text/") instead.