Open yehosef opened 7 years ago
This could be removed by a siteconfig but it seems that the page retrieve by graby isn't really the same as the one you visit using your browser. And this text isn't part of a html tag so it can't be removed. And I don't want to introduce dirty string replacement to remove such content from a page because it might lead to side behavior.
While I'm not saying it's trivial - this is a real problem and it's not that hard to fix. Closing it just hides that it's a problem.
You could do a check for the text inside a "figure" tag and if it matches basic criteria like " copyright" or " images" without other significant text you can presumable safely remove it.
If you close issues just because you don't want to deal with it, it doesn't encourage other to fix the problem. I think it would be better to leave it open and say - "This is a problem, but I don't know how to solve it in a safe, generic way - looking for ideas/PRs".
On Fri, Sep 1, 2017 at 11:33 AM, Jérémy Benoist notifications@github.com wrote:
This could be removed by a siteconfig but it seems that the page retrieve by graby isn't really the same as the one you visit using your browser. And this text isn't part of a html tag so it can't be removed. And I don't want to introduce dirty string replacement to remove such content from a page because it might lead to side behavior.
[image: image] https://user-images.githubusercontent.com/62333/29961734-cd2ac4b6-8f00-11e7-8ef6-ebf75c4d9a2e.png
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/j0k3r/graby/issues/117#issuecomment-326523328, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJHBaHTLX2RYIBXLDCTys5zpeicQ09sks5sd8FZgaJpZM4PJGhi .
Well ok, I can re-open that issue but I think no-one will take a closer look at it and it'll stay open for ever without anybody trying to solve that problem.
Because, even the figure
check you're talking about might be a bit tricky too. What if the whole content is an article about copyright madness? Then I guess, most of text included in the figure tag might be relevant and shouldn't be removed from the content. What will you do in that case?
Thanks for reopening.
You can get the text content inside any element - the suggestion I was making is that there is no other text in the figure element, you can safely delete it. If there is a lot of other content, then you might be right to assume that the text there is significant and not delete it.
While I'm not saying there aren't edge cases in the other direction. This is a real problem (that happens to be in the example URL for the lib..) and those are possible problems which may or may not exist.
From https://developer.mozilla.org/en/docs/Web/HTML/Element/figure:
I would think if doesn't have a lot of text, and especially if it contains some references to copyright, then it can be safely deleted. As a comparison - you apparently delete the figcaption even though it has real text in it (and I think that's the right decision).
While you may end up deleting things that shouldn't be, but I think it's more likely that the text in the figure is not useful and the overall quality will be better.
I guess this issue can be closed as there was no activity for the past 5 years
In the example url http://www.bbc.com/news/entertainment-arts-32547474
The html that is returned looks like:
In the original HTML, there are some extra tags for copyright notices that could/should be replaced
I'm not sure of generic ways to do this.. perhaps text in the figure tab, perhaps the specific text (or similar variations).