scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Option to include all tags and attrs in LinkExtractor with specified exclusions #6321

Open User087 opened 3 weeks ago

User087 commented 3 weeks ago

Summary

Add the option to the LinkExtractor class to consider all tags and attributes (e.g. if you pass None then consider all tags/attributes), and deny_tags and deny_attrs arguments or similar so you can additionally consider all tags and attributes with the exception of those explicitly passed.

Motivation

It allows adopting a strategy of extracting all links by default and then specifically excluding the tags and attributes you don't want considered. Currently, it seems the user has to figure out all the specific tags and attributes where they're desired links appear and explicitly pass them to tags and attrs to have them considered.

Describe alternatives you've considered

For including all tags, you could use the Selector class instead of LinkExtractor and select all e.g. href attributes regardless of which tag they appear in, e.g. response.xpath('//@href'). Using Selector results in losing the various convenient arguments in LinkExtractor and requires manually processing them with regex etc instead, and it requires manually converting relative links into absolute links when you want to use regexes that match the entire URL whereas LinkExtractor already handles that automatically.

Additional context

Any additional information about the feature request here.

PJ1256 commented 2 weeks ago

I would like to try and work on this if that's ok.

PredictiveManish commented 2 weeks ago

I am trying to solve this issue.

parthvichare commented 2 weeks ago

Its great challenging problem, I love to work on it

Noman654 commented 2 weeks ago

@Laerte, I'd like to tackle this issue! As it's my first contribution to the project, any pointers to get me started would be much appreciated.

Laerte commented 2 weeks ago

Hi @Noman654 seems that we already have a open PR for it:

Gallaecio commented 1 week ago

https://github.com/scrapy/scrapy/pull/6327 is about the first part, making None include all.

The deny part could be implemented separately by someone else, I think. There could be conflicts, but they should be easy to resolve. I do think a boolean reverse_filter parameter would be better than 2 new parameters to implement that behavior, though.