remove copyright notices from images (may be other similar situations)

yehosef commented 7 years ago

In the example url http://www.bbc.com/news/entertainment-arts-32547474

The html that is returned looks like:

In the original HTML, there are some extra tags for copyright notices that could/should be replaced

ben_e_king__r_b_legend_dies_at_76_-_bbc_news

I'm not sure of generic ways to do this.. perhaps text in the figure tab, perhaps the specific text (or similar variations).

j0k3r commented 7 years ago

This could be removed by a siteconfig but it seems that the page retrieve by graby isn't really the same as the one you visit using your browser. And this text isn't part of a html tag so it can't be removed. And I don't want to introduce dirty string replacement to remove such content from a page because it might lead to side behavior.

yehosef commented 7 years ago

While I'm not saying it's trivial - this is a real problem and it's not that hard to fix. Closing it just hides that it's a problem.

You could do a check for the text inside a "figure" tag and if it matches basic criteria like " copyright" or " images" without other significant text you can presumable safely remove it.

If you close issues just because you don't want to deal with it, it doesn't encourage other to fix the problem. I think it would be better to leave it open and say - "This is a problem, but I don't know how to solve it in a safe, generic way - looking for ideas/PRs".

On Fri, Sep 1, 2017 at 11:33 AM, Jérémy Benoist notifications@github.com wrote:

This could be removed by a siteconfig but it seems that the page retrieve by graby isn't really the same as the one you visit using your browser. And this text isn't part of a html tag so it can't be removed. And I don't want to introduce dirty string replacement to remove such content from a page because it might lead to side behavior.

[image: image] https://user-images.githubusercontent.com/62333/29961734-cd2ac4b6-8f00-11e7-8ef6-ebf75c4d9a2e.png

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/j0k3r/graby/issues/117#issuecomment-326523328, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJHBaHTLX2RYIBXLDCTys5zpeicQ09sks5sd8FZgaJpZM4PJGhi .

j0k3r commented 7 years ago

Well ok, I can re-open that issue but I think no-one will take a closer look at it and it'll stay open for ever without anybody trying to solve that problem.

Because, even the figure check you're talking about might be a bit tricky too. What if the whole content is an article about copyright madness? Then I guess, most of text included in the figure tag might be relevant and shouldn't be removed from the content. What will you do in that case?

yehosef commented 7 years ago

Thanks for reopening.

You can get the text content inside any element - the suggestion I was making is that there is no other text in the figure element, you can safely delete it. If there is a lot of other content, then you might be right to assume that the text there is significant and not delete it.

While I'm not saying there aren't edge cases in the other direction. This is a real problem (that happens to be in the example URL for the lib..) and those are possible problems which may or may not exist.

From https://developer.mozilla.org/en/docs/Web/HTML/Element/figure:

Usually a
is an image, illustration, diagram, code snippet, etc., that is referenced in the main flow of a document, but that can be moved to another part of the document or to an appendix without affecting the main flow.
- Being a sectioning root, the outline of the content of the
  element is excluded from the main outline of the document.

I would think if doesn't have a lot of text, and especially if it contains some references to copyright, then it can be safely deleted. As a comparison - you apparently delete the figcaption even though it has real text in it (and I think that's the right decision).

While you may end up deleting things that shouldn't be, but I think it's more likely that the text in the figure is not useful and the overall quality will be better.

Kdecherf commented 2 years ago

I guess this issue can be closed as there was no activity for the past 5 years

j0k3r / graby

remove copyright notices from images (may be other similar situations) #117