j0k3r / graby

Graby helps you extract article content from web pages
MIT License
363 stars 73 forks source link

ContentExtractor: fix handling of data-srcset #248

Closed Kdecherf closed 3 years ago

Kdecherf commented 3 years ago

The extractor was incorrectly setting the value of data-srcset into the src attribute, however srcset does not use the same format leading to loading error with incorrect paths. Now data-srcset sets the value of srcset which will let the browser to use paths of either src or srcset attributes.

This PR may require new tests to ensure non-regression.

Here is an example with https://www.numerama.com/sciences/661291-covid-19-le-modele-emmental-montre-aussi-pourquoi-la-desinformation-est-si-grave.html:

<img class="size-large wp-image-661339" alt="" src="https://www.numerama.com/wp-content/uploads/2020/11/sgr-1935-2154-sursauts-radio-rapides-1024x576.jpg%201024w,%20https://www.numerama.com/content/uploads/2020/10/gestes_barrieres_emmental-680x383.jpg%20680w,%20https://www.numerama.com/content/uploads/2020/10/gestes_barrieres_emmental-1536x864.jpg%201536w,%20https://www.numerama.com/content/uploads/2020/10/gestes_barrieres_emmental.jpg%201920w" referrerpolicy="no-referrer" width="1024" height="576">

Fixes https://github.com/wallabag/wallabag/issues/4914

coveralls commented 3 years ago

Coverage Status

Coverage increased (+0.01%) to 96.469% when pulling 20d62507b8993de7157ab537ee95b7a6a51ee3b4 on Kdecherf:fix/data-srcset into 1c581bb80076d933d184dbaf540ee63dec74501c on j0k3r:master.