openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
18 stars 16 forks source link

Fix HTML link rewriting and CSS link rewriting #34

Closed satyamtg closed 3 years ago

satyamtg commented 4 years ago

This fixes the following -

Additional changes -

rgaudin commented 3 years ago

Thank you. There a a couple things I don't understand:

satyamtg commented 3 years ago
  • what's the purpose of the regexp change? I'm not sure I understand what greedy means in this context. I don't understand what the ? will trigger here.

By greedy I mean, without the ?, the regexp would match the longest match. Say if we have - url("../../font.ttf") format("TrueType"), the match would return "../../font.ttf") format("TrueType". The question mark allows to match the shortest possible match. You can read more here

  • You should not have to change existing tests to add your feature. Looking at it, I see that the tests were incorrect as not passing expected Path. That should be explained in the PR.

Agrees. Updated the main comment and will take care of this in the future

  • Absolute URLs are forbidden in zim. I think those should not be written in the first time. This rewriting stuff should just take take of handling the namespace but not fixing errors in HTML. libzim will soon drop namespaces so we won't be rewritting anymore and that would be an invalid link. whats's the use case?

Okay, so rewriting root-relative URLs made the stuff in openedx a bit easier to implement. I did know that root-relative is forbidden in ZIM, but thought that since we are rewriting links anyway, it could help (didn't know libzim is about to drop namespaces). If that's the case then I think I shall handle root-relative in scrapers itself. So, if we chose to drop root-relatives, https://github.com/openzim/openedx/pull/70 wont work at the moment, unless I put another patch. What do you suggest?

rgaudin commented 3 years ago

By greedy I mean, without the ?, the regexp would match the longest match. Say if we have - url("../../font.ttf") format("TrueType"), the match would return "../../font.ttf") format("TrueType". The question mark allows to match the shortest possible match. You can read more here

Thanks for pointing that out. I did not expect that. Could you please add such a test then?

Okay, so rewriting root-relative URLs made the stuff in openedx a bit easier to implement. I did know that root-relative is forbidden in ZIM, but thought that since we are rewriting links anyway, it could help (didn't know libzim is about to drop namespaces). If that's the case then I think I shall handle root-relative in scrapers itself. So, if we chose to drop root-relatives, openzim/openedx#70 wont work at the moment, unless I put another patch. What do you suggest?

I think we should take care of this now. Where are those links coming from btw? Our templates of edx ?

https://github.com/openzim/libzim/issues/325

codecov[bot] commented 3 years ago

Codecov Report

Merging #34 into master will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #34   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           19        19           
  Lines          754       757    +3     
=========================================
+ Hits           754       757    +3     
Impacted Files Coverage Δ
src/zimscraperlib/zim/rewriting.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a032a69...c5d7368. Read the comment docs.

satyamtg commented 3 years ago

Pushed some changes to allow external URLs and raise an exception if root-relative URLs are found.

rgaudin commented 3 years ago

Pushed some changes to allow external URLs and raise an exception if root-relative URLs are found.

Hum, I don't see the added value of this. This module's role is to rewrite URLs to handle the namespace stuff. That's it.

What you added is a validation check. While this can be handy in that it lets you know at scraping time that you have an incorrect link, I don't think it's the path to follow as it implies the scraperlib does some kind of validation while it doesn't and probably won't.

There are many things that can be checked and that's why zimcheck exists. We don't use zimcheck enough for now but we should use it more while developing scrapers ; that's why we have a ticket to integrate it with zimfarm.

I'd prefer we stick to “we rewrite what we know needs rewriting and leave the rest as is”.

Also, changelog messages should be more specific here as this project is developer-oriented and the changelog would be first stop for people with an issue to look for answers.