satyamtg commented 4 years ago

This fixes the following -

~~Proper rewriting of root-relative links~~ - Dropped as its forbidden in ZIM and shall be handled by scrapers given that we plan to get rid of namespaces
CSS URLs are now found in a non greedy manner

Additional changes -

Changed str to Path in test_fix_target_for() as the function it tests expects a Path

rgaudin commented 3 years ago

Thank you. There a a couple things I don't understand:

what's the purpose of the regexp change? I'm not sure I understand what greedy means in this context. I don't understand what the ? will trigger here.
Absolute URLs are forbidden in zim. I think those should not be written in the first time. This rewriting stuff should just take take of handling the namespace but not fixing errors in HTML. libzim will soon drop namespaces so we won't be rewritting anymore and that would be an invalid link. whats's the use case?
You should not have to change existing tests to add your feature. Looking at it, I see that the tests were incorrect as not passing expected Path. That should be explained in the PR.

satyamtg commented 3 years ago

what's the purpose of the regexp change? I'm not sure I understand what greedy means in this context. I don't understand what the ? will trigger here.

By greedy I mean, without the ?, the regexp would match the longest match. Say if we have - url("../../font.ttf") format("TrueType"), the match would return "../../font.ttf") format("TrueType". The question mark allows to match the shortest possible match. You can read more here

You should not have to change existing tests to add your feature. Looking at it, I see that the tests were incorrect as not passing expected Path. That should be explained in the PR.

Agrees. Updated the main comment and will take care of this in the future

Absolute URLs are forbidden in zim. I think those should not be written in the first time. This rewriting stuff should just take take of handling the namespace but not fixing errors in HTML. libzim will soon drop namespaces so we won't be rewritting anymore and that would be an invalid link. whats's the use case?

Okay, so rewriting root-relative URLs made the stuff in openedx a bit easier to implement. I did know that root-relative is forbidden in ZIM, but thought that since we are rewriting links anyway, it could help (didn't know libzim is about to drop namespaces). If that's the case then I think I shall handle root-relative in scrapers itself. So, if we chose to drop root-relatives, https://github.com/openzim/openedx/pull/70 wont work at the moment, unless I put another patch. What do you suggest?

rgaudin commented 3 years ago

By greedy I mean, without the ?, the regexp would match the longest match. Say if we have - url("../../font.ttf") format("TrueType"), the match would return "../../font.ttf") format("TrueType". The question mark allows to match the shortest possible match. You can read more here

Thanks for pointing that out. I did not expect that. Could you please add such a test then?

Okay, so rewriting root-relative URLs made the stuff in openedx a bit easier to implement. I did know that root-relative is forbidden in ZIM, but thought that since we are rewriting links anyway, it could help (didn't know libzim is about to drop namespaces). If that's the case then I think I shall handle root-relative in scrapers itself. So, if we chose to drop root-relatives, openzim/openedx#70 wont work at the moment, unless I put another patch. What do you suggest?

I think we should take care of this now. Where are those links coming from btw? Our templates of edx ?

https://github.com/openzim/libzim/issues/325

codecov[bot] commented 3 years ago

Codecov Report

Merging #34 into master will not change coverage. The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master       #34   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           19        19           
  Lines          754       757    +3     
=========================================
+ Hits           754       757    +3

Impacted Files	Coverage Δ
src/zimscraperlib/zim/rewriting.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a032a69...c5d7368. Read the comment docs.

satyamtg commented 3 years ago

Pushed some changes to allow external URLs and raise an exception if root-relative URLs are found.

rgaudin commented 3 years ago

Pushed some changes to allow external URLs and raise an exception if root-relative URLs are found.

Hum, I don't see the added value of this. This module's role is to rewrite URLs to handle the namespace stuff. That's it.

What you added is a validation check. While this can be handy in that it lets you know at scraping time that you have an incorrect link, I don't think it's the path to follow as it implies the scraperlib does some kind of validation while it doesn't and probably won't.

There are many things that can be checked and that's why zimcheck exists. We don't use zimcheck enough for now but we should use it more while developing scrapers ; that's why we have a ticket to integrate it with zimfarm.

I'd prefer we stick to “we rewrite what we know needs rewriting and leave the rest as is”.

Also, changelog messages should be more specific here as this project is developer-oriented and the changelog would be first stop for people with an issue to look for answers.

openzim / python-scraperlib

Fix HTML link rewriting and CSS link rewriting #34

Codecov Report