scrapy / scrapely

A pure-python HTML screen-scraping library
1.86k stars 315 forks source link

safehtml omit some important (all) attributes of tags #79

Open SirbitoX opened 8 years ago

SirbitoX commented 8 years ago

Let's consider that someone (like me) want to keep an img tag so the src attribute of this tag would be important for him/her. But safehtml() function omit all the attributes of the relevant tag. I think it would better to keep attributes of allowed_tags or add another param named allowed_attributes to specify which attributes to keep.

ruairif commented 8 years ago

Hi @SirbitoX. I was having a discussion about this last week and we were thinking about adding a new less strict version of safe html. The new type would be somewhere between raw html and safe html keeping img tags and possibly other tags too.

Other than img tags what other tags do you add? Would you mind explaining your specific use case? Are you extracting articles or products or leads?

SirbitoX commented 8 years ago

Hi @ruairif, I'm extracting articles and I keep all the images in the description of scraped article so to do this I would need the src attribute or even height and width attributes of the img tag. Probably I plan to keep the embed videos in the description, either. But it wouldn't be an issue if we support something like allowed_attributes.