Closed sblondon closed 10 years ago
I did some refactoring and created a pylinkchecker.api module where we will put all functions that need to be backward compatible across releases. There is already one small function, crawl(url)
that is called in the unit tests. My plan is to augment existing objects with methods that are easier to use with an API. Once the API is implemented, I'll generate the documentation with Sphinx.
crawl_with_options now supports the same options as the command line interface. I added a small code example in the README in the branch.
@sblondon , if that's an acceptable API to start with, I'll merge the branch into master.
Thank you for the branch! :-)
I have two remarks/questions:
crawl_with_options([only_one_url])
Of course, if the first function can set options, the second one is badly named. The names could become 'crawl_url' and 'crawl_urls' for example.
def crawl_with_options(urls, test_outside=False, ignored_prefixes=[], username=None, password=None,
types=["a", "img", "script"], timeout=DEFAULT_TIMEOUT, run_once=True, workers=1,
mode=MODE_THREAD, parser=PARSER_STDLIB, logger=None)
However the function signature is much more longer so I'm not sure it's a real improvement.
Note that I didn't include the progress option. Perhaps it could be a problem for the api.
With a merge of the two remarks, the signature and code could become:
def crawl_url(url, test_outside=False, ignored_prefixes=[], username=None, password=None, types=["a", "img", "script"], timeout=DEFAULT_TIMEOUT, run_once=True, workers=1, mode=MODE_THREAD, parser=PARSER_STDLIB, logger=None):
return crawl_with_options([url], test_outside=test_outside, ignored_prefixes=ignored_prefixes, username=None, password=None, types=["a", "img", "script"], timeout=timeout, run_once=run_once, workers=workers, mode=mode, parser=parser, logger=logger)
def crawl_urls(urls, test_outside=False, ignored_prefixes=[], username=None, password=None, types=["a", "img", "script"], timeout=DEFAULT_TIMEOUT, run_once=True, workers=1, mode=MODE_THREAD, parser=PARSER_STDLIB, logger=None):
#build the dict from the optional parameters
#current code
The example is only a draft, I didn't test it really. It seems there is a special case with the logger in the second function but I don't understand it clearly.
Off-topic: Do you have a specific method to execute the tests? (I used nosetests to execute them and it works but perhaps you do it in another way.)
Hi,
the rationale behind the two functions is that the first one is for the default case: crawl one site (from one starting URL) with default options. The default options might evolve as we stabilize pylinkchecker, but that would make it very easy for anybody to try pylinkchecker without bothering about the options.
Once you start having more complex requirements (more than one starting urls, specifying the number of workers, etc.), than you need to use the more advanced function (crawl_with_options). The crawl function is really a shortcut.
Regarding the parameters, I should have explained the rationale as well:
Using a dict was not an easy decision, but after considering the pros and cons, we believed the dict was a better choice.
Regarding the tests, we created the test suite so that nosetests
would work out of the box.
I understand your point and I'm ok with your choice.
What do you plan to do now? Do you want some help?
I need to fix the logger attribute (it should be a function, not an object), then I'll merge this branch into master. Other improvements (sphinx doc, programmatic progress monitor) will come later with other issues! Thanks for your input on this!
Now in @master
It could be nice to have a public API, so pylinkchecker could be easily used as a python module.
For example: import pylinkchecker
checker = pylinkchecker.Checker("http://website.tld") checker.crawl() #do requests and parse responses checker.errors() #get error requests (4xx, 5xx) checker.success() #get 2xx requests checker.errors().to_html() #return a string of html of the error requests
(This is just a draft.)