Closed connorskees closed 6 years ago
Thanks for fixing all that stuff :heart:
I will merge the pull request as soon as I finish evaluating and testing the code.
Meanwhile, can you please mention the changes in the changes
variable in photon.py
?
is_link()
function accepts a string as an argument i.e. a single URL.
While your parse()
function accepts a list of URLs.
But what's the point of accepting a list when you have only a single URL available inside the is_link()
function?
There isn't really a point in having it accept a list of urls. I originally wrote the function thinking it would be called on storage
before the links in storage
were crawled. I instead used it with is_link()
and did not edit the function.
How about this? I guess this is more efficient :)
Is there a reason to use a regex for finding a
tags over bs4?
bs4 is slower.
It's not working as intended, can you please check what's wrong?
It was adding urls from zap. Now, doing -- exclude .* gives
It doesn't seem to work :/
I have a 3 level deep mirror of stackoverflow on my localhost where I test Photon.
It's confusing! It's working perfectly with the real stackoverflow. Thanks for everything. I will merge it in 5.
Alright, I found the bug. It doesn't work if the website doesn't have a sitemap.xml or robots.txt
Checking every URL against the regex is a bad idea because it will reduce performance.
Can't you implement it in extractor
where links are returned so we can quickly return the ones which don't match?
It works perfectly now but I have one more change to make but I will make it myself because I have bothered you enough already. Thanks for everything bro :heart: You rock!
sorry, just wondering what about the excluding feature was poorly implemented (for my own reference)
I suggested that checking URLs one by one is slow and that's why I asked you to move it from is_link()
to extractor
but you still put it in a place where the URLs will be checked one by i.e. in a for
loop.
I didn't want to bother you anymore & it would have been unfair if I didn't merge your request after this much of hard work so I kept the re-factor and merged it.
However, I have now realized the best place to to exclude the URLs is this.
I'm really interested now- how can you exclude the urls matching the regex without checking them all one by one?
It's totally fine bothering me :) this is a great learning experience
It's not really about matching them one by one, it's about the optimal location to so. If you match them while our precious http requests are being made, you will end up eating their time. That's why I suggested another location in the previous comment.
Added option to skip crawling of URLs that match a regex pattern, changed handling of seeds, and removed double spaces.
I am assuming this is what you meant in the ideas column tell me if I'm way off though :)