spider-rs / spider-py

Spider ported to Python
https://spider-rs.github.io/spider-py/
MIT License
47 stars 4 forks source link

Whitelisting, strange results #4

Closed LiamK closed 4 months ago

LiamK commented 4 months ago

I'm getting very strange results with the new with_whitelist_url feature. Are there any working examples? Or am I using it incorrectly? The same str argument that works for blacklisting doesn't work for whitelisting.

website = Website("https://choosealicense.com").with_whitelist_url(['/license'])
website.scrape()
website.get_pages()
[]

However, blacklisting successfully eliminates the pages without "/license" in the url, returning everything else.

website = Website("https://choosealicense.com").with_blacklist_url(['/license'])
website.scrape()
website.get_pages()
[<builtins.NPage object at 0x28d35ff30>, <builtins.NPage object at 0x28d36c030>, <builtins.NPage object at 0x28d36c0a0>, <builtins.NPage object at 0x28d36c110>, <builtins.NPage object at 0x28d36c180>, <builtins.NPage object at 0x28d36c1f0>, <builtins.NPage object at 0x28d36c260>, <builtins.NPage object at 0x28d36c2d0>]

Similarly, for regular expressions. Initially, I was trying to use Python regular expression objects, e.g. re.compile('/lic.*'). But it appears it has to be a regular string containing the regular expression? I found that confusing. How do you know whether the user is providing a string or a regular expression? There are characters valid in urls that have special meaning in a regular expression.

>>> url = 'https://choosealicense.com/licenses/'
>>> s1 = '/license?'
>>> re_s1 = re.compile(s1)
>>> s1 in url
False
>>> re_s1.findall(url)
['/license']
>>> s2 = '/licenses+'
>>> re_s2 = re.compile(s2)
>>> re_s2.findall(url)
['/licenses']
>>> s2 in url
False
>>> 

I really want the whitelisting to work, it will clean up my code a lot. But so far, it's not.

j-mendez commented 4 months ago

Hello, there is no url on the website with /license. There is a /licenses path. The string compiles to a regex.

Here is an example.

working example of spider-rs whitelisting

LiamK commented 4 months ago

First of all, my expectation was that '/license' would also match '/licenses'. If you are turning that into a regular regular expression then it should match.

It's not documented whether you're anchoring the re at the start. I've been assuming it can be found anywhere in the path part of the url.

There is a discrepancy between the behavior with scrape() and crawl(). crawl() does find some links, as you demonstrated in the example above. I had expected that scrape() would apply the same rules in selecting the pages to return. It does not. It returns an empty list.

However, it appears that crawl() misses a bunch of links that match the whitelist string. It finds these:

https://choosealicense.com - status: 200
https://choosealicense.com/licenses/ - status: 200
https://choosealicense.com/licenses/mit/ - status: 200
https://choosealicense.com/licenses/unlicense/ - status: 200

However, if you don't whitelist anything, get_links() or get_pages() return links/pages for all of these, which include many urls with '/licenses/ in the url:

https://choosealicense.com - status: 200
https://choosealicense.com/licenses/mit/ - status: 200
https://choosealicense.com/licenses/ - status: 200
https://choosealicense.com/no-permission/ - status: 200
https://choosealicense.com/community/ - status: 200
https://choosealicense.com/terms-of-service/ - status: 200
https://choosealicense.com/non-software/ - status: 200
https://choosealicense.com/about/ - status: 200
https://choosealicense.com/ - status: 200
https://choosealicense.com/licenses/unlicense/ - status: 200
https://choosealicense.com/appendix/ - status: 200
https://choosealicense.com/licenses/isc/ - status: 200
https://choosealicense.com/licenses/isc - status: 200
https://choosealicense.com/appendix - status: 200
https://choosealicense.com/licenses/bsd-2-clause/ - status: 200
https://choosealicense.com/licenses/unlicense - status: 200
https://choosealicense.com/licenses/mit - status: 200
https://choosealicense.com/licenses/bsd-2-clause - status: 200
https://choosealicense.com/licenses/vim - status: 200
https://choosealicense.com/licenses/wtfpl - status: 200
https://choosealicense.com/licenses/zlib - status: 200
https://choosealicense.com/licenses/ms-pl - status: 200
https://choosealicense.com/licenses/bsd-3-clause - status: 200
https://choosealicense.com/licenses/postgresql - status: 200
https://choosealicense.com/licenses/bsd-3-clause-clear - status: 200
https://choosealicense.com/licenses/bsd-2-clause-patent - status: 200
https://choosealicense.com/licenses/ms-rl - status: 200
https://choosealicense.com/licenses/0bsd - status: 200
https://choosealicense.com/licenses/mit-0 - status: 200
https://choosealicense.com/licenses/ncsa - status: 200
https://choosealicense.com/licenses/bsd-4-clause - status: 200
https://choosealicense.com/licenses/bsd-3-clause/ - status: 200
https://choosealicense.com/licenses/ms-pl/ - status: 200

I expected scrape() with get_pages() to get all of the pages with '/licenses' in the url if the whitelist was set to ['/licenses'], but it does not.

j-mendez commented 4 months ago

Try to add the sitemap. The link has to be found on the page. Feel free to push a PR for any changes under scrape. The API needs refactoring.

j-mendez commented 4 months ago

license example for the regex by default the regex is unanchored. https://docs.rs/regex/latest/regex/struct.RegexSet.html

LiamK commented 4 months ago

I'm not a Rust programmer, so my capacity to help is limited to reporting my experience using the Python bindings.

If you look at the page, the links returned are almost correct for that specific page. I say almost correct because there's one that it misses.

https://choosealicense.com/licenses/gpl-3.0/
in the html it's:
<a href="licenses/gpl-3.0/">GNU GPLv3</a>

But the main problem is that apparently it doesn't continue crawling the whitelisted urls! That would explain the discrepancy in crawl() vs scrape() behavior. Yes, it can get the links on that first page, but since it doesn't follow the /licenses/ path, it never sees the pages that are linked to, so get_pages() returns an empty list.

I think the whitelisting and blacklisting have to be approached differently.

If you're blacklisting you can examine each url as you get to it, to see if it matches and ignore it if it does. If you're whitelisting, you have to crawl the whole site first, and then only follow/scrape the ones that match. That's because there's no guarantee that the whitelisted links will be on the site's '/' page.

This is the approach that I was taking previously. First, I got all the links for the entire site, then I filtered them based on my own regex expressions, and used Page to get them individually. It works, but it seems cumbersome. [I tried the alternative of blacklisting everything that wasn't what I wanted, but that seemed like it would be hard to maintain.]

I was a little surprised that website = Website("https://choosealicense.com/licenses/") didn't start by crawling that page, apparently. It would be useful to me to start on a particular page and then only crawl/scrape pages that match the whitelist regular expressions.

It seems like the string that's passed in is used in a regex, but is not converted to a regex itself.

It would be very convenient to pass in an actual regex, or to have the string converted to a Rust regex. Then you could blacklist/whitelist everything that started with some pattern, or get as granular as necessary.

In my case what I would like would be to start on a particular page, and only crawl/scrape pages accessible from that page that matched the whitelisted url.

website = Website('https://eventlisting.com/all_events/').with_whitelist_path(['^/event/'])
website.crawl()
website.get_links()
[https://eventlisting.com/event/event1/', https://eventlisting.com/event/event2/', ]