scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 315 forks source link

safehtml: enclose table elements between table/tbody when they are not present, added tests (#24) #25

Closed kalessin closed 11 years ago

shaneaevans commented 12 years ago

It doesn't handle nesting of tables correctly and allows many cases where incorrect html is still produced from table tags.

>>> t(u'<span>pre text</span><tr>hello world</td>')
u'pre text<table><tbody><tr>hello world</tbody></table>'
>>> t(u'<table>hello world')
u'<table>hello world</table>'

It should either remove the invalid nesting, or insert the missing tags, but never produce incorrect HTML.

Is there any reason why you don't make this more generic and also use it for lists (ul, li), which have similar nesting problems.

I have some comments on the source code too, but I think it's important to get the algorithm correct first. One way to help testing would be to generate random HTML, feed it to safe_html, and check the output is always valid.

kalessin commented 12 years ago

A brief think before answering the observations, and let me know what do you think.

I think the purpose of this method is to produce safe code (that is, something that does not break the layout when rendered), not "correct" html, and we must carefully think how to exactly transform the input, because there are several ways to produce safe code from an unsafe one, and limit this kind of transformations as much as possible. For several reasons:

  1. what to produce at output can depend on user side needs (that is why i first suggested in solving this problem at client side, not safehtml, although i agree that this case should be fixed in a general way here, because it is the purpose of this method)
  2. if we will try to fix any wrong code, we will end up in a source fixing code like the one that BeautifulSoup has (may be we can use its methods instead?)
  3. do we really need that, anyway? as the purpose of safehtml is to create code that can be safely rendered, we can (and i think, we should) leave the rest of problems with the source to the browser itself, which already does that.

So we should consider the scope of this method and carefully weight which cases do we really want to fix and how, because the most complex, the less efficient and less maintainable. Consider that in AS we are surely coping with lots of cases of bad html code at output, but this is the first time that we see a problem manifested in the rendering on tables. That is because the browser makes most of the job.

Said that, the first case you show is clearly wrong, but safe. It is the kind of things that the browser can safely render because it adds by itself the lacking tags and the problematic source is enclosed among < table>< /table>. However it is an ugly output and seems to imply a very easy fix of the algorithm, so i anyway think that is worth to correctly handle that case.

In the second case, i even doubt that it is really a wrong html code, so i think that we should not worry about this one while safehtml produces a safe code. And i even think that the current output is a possible correct one. However, most probably output will be different with the fix of the first case. The problem is: should we allow to create the entire table structure for this case? if so, no problem. If not, the code will be a bit more complex (and less efficient). Here appears clearly a compromise between code efficiency loss by doing things that seems not really necessary to do,

About lists, i purposedly avoided to handle this case and other possible cases of nested structures, for the same reasons above and because we are not sure that all nested structures must be handled in the same way. This depends on user side needs. Also, the additional problem with < li> elements is that they can belong to < ul> or to < ol> lists, so, how should really be handled the case when we encounter a sole < li> without its parent? And if one of the < ul> or < ol> appears, it is clear that the algorithm will not be the same as the table one, because extra decisions has to be taken. The table is a more simple case than lists and is treated in a different way.