openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

mailto: links being rewritten while creating ZIM #46

Closed satyamtg closed 3 years ago

satyamtg commented 3 years ago

The mailto: links are being accidentally rewritten by scraperlib while creating ZIMs. This is due to the fact that the urlparse method makes the e-mail ID as the path.

>>> import urllib.parse
>>> urllib.parse.urlparse("mailto:io.satyamtg@gmail.com?subject=haha")
ParseResult(scheme='mailto', netloc='', path='io.satyamtg@gmail.com', params='', query='subject=haha', fragment='')

This is the cause of some weird invalid links in openedx2zim such as this one

  The following links:
- ../../I/9a122b295d484793bbf1a33ab0217a69/digitallearning@phzh.ch?Subject=Feedback%20CORE%20English%20#01,%20v1:PHZH+W-IB+2019_E_1
(I/9a122b295d484793bbf1a33ab0217a69/digitallearning@phzh.ch) were not found in article A/9a122b295d484793bbf1a33ab0217a69/index.html
rgaudin commented 3 years ago

OK we shall exclude mailto: starting links then.

kelson42 commented 3 years ago

Just in case, tel:, geo: and of course data: should not either (in case this is still not handled).