mtlevolio / pylinkchecker

standalone and pure python link checker and crawler that traverses a web site and reports errors
Other
34 stars 8 forks source link

providing an API to use pylinkchecker as a python module #14

Closed sblondon closed 10 years ago

sblondon commented 10 years ago

It could be nice to have a public API, so pylinkchecker could be easily used as a python module.

For example: import pylinkchecker

checker = pylinkchecker.Checker("http://website.tld") checker.crawl() #do requests and parse responses checker.errors() #get error requests (4xx, 5xx) checker.success() #get 2xx requests checker.errors().to_html() #return a string of html of the error requests

(This is just a draft.)

bartdag commented 10 years ago

I did some refactoring and created a pylinkchecker.api module where we will put all functions that need to be backward compatible across releases. There is already one small function, crawl(url) that is called in the unit tests. My plan is to augment existing objects with methods that are easier to use with an API. Once the API is implemented, I'll generate the documentation with Sphinx.

bartdag commented 10 years ago

crawl_with_options now supports the same options as the command line interface. I added a small code example in the README in the branch.

@sblondon , if that's an acceptable API to start with, I'll merge the branch into master.

sblondon commented 10 years ago

Thank you for the branch! :-)

I have two remarks/questions:

crawl_with_options([only_one_url])

Of course, if the first function can set options, the second one is badly named. The names could become 'crawl_url' and 'crawl_urls' for example.

def crawl_with_options(urls, test_outside=False, ignored_prefixes=[], username=None, password=None, 
                                     types=["a", "img", "script"], timeout=DEFAULT_TIMEOUT, run_once=True, workers=1, 
                                     mode=MODE_THREAD, parser=PARSER_STDLIB, logger=None)

However the function signature is much more longer so I'm not sure it's a real improvement.

Note that I didn't include the progress option. Perhaps it could be a problem for the api.

With a merge of the two remarks, the signature and code could become:

def crawl_url(url, test_outside=False, ignored_prefixes=[], username=None, password=None, types=["a", "img", "script"], timeout=DEFAULT_TIMEOUT, run_once=True, workers=1, mode=MODE_THREAD, parser=PARSER_STDLIB, logger=None):
    return crawl_with_options([url], test_outside=test_outside, ignored_prefixes=ignored_prefixes, username=None, password=None, types=["a", "img", "script"], timeout=timeout, run_once=run_once, workers=workers, mode=mode, parser=parser, logger=logger)

def crawl_urls(urls, test_outside=False, ignored_prefixes=[], username=None, password=None, types=["a", "img", "script"], timeout=DEFAULT_TIMEOUT, run_once=True, workers=1, mode=MODE_THREAD, parser=PARSER_STDLIB, logger=None):
    #build the dict from the optional parameters
    #current code

The example is only a draft, I didn't test it really. It seems there is a special case with the logger in the second function but I don't understand it clearly.

Off-topic: Do you have a specific method to execute the tests? (I used nosetests to execute them and it works but perhaps you do it in another way.)

bartdag commented 10 years ago

Hi,

the rationale behind the two functions is that the first one is for the default case: crawl one site (from one starting URL) with default options. The default options might evolve as we stabilize pylinkchecker, but that would make it very easy for anybody to try pylinkchecker without bothering about the options.

Once you start having more complex requirements (more than one starting urls, specifying the number of workers, etc.), than you need to use the more advanced function (crawl_with_options). The crawl function is really a shortcut.

Regarding the parameters, I should have explained the rationale as well:

  1. There are already too many options to explicitly declare all options as parameters.
  2. If an option in the option_dict is wrong, an exception will be raised by the option parser so at least, you get some validation.
  3. In the future, if we add more options (and we will), we won't have to modify multiple entry points (cli + api).
  4. The API will probably have even more options than cli. These "extra" options will be parameters to the crawl_with_options function. For example, you can pass a logger instance if you don't like our default logging policy. You'll probably be able to pass a progress monitor object in the future also.

Using a dict was not an easy decision, but after considering the pros and cons, we believed the dict was a better choice.

Regarding the tests, we created the test suite so that nosetests would work out of the box.

sblondon commented 10 years ago

I understand your point and I'm ok with your choice.

What do you plan to do now? Do you want some help?

bartdag commented 10 years ago

I need to fix the logger attribute (it should be a function, not an object), then I'll merge this branch into master. Other improvements (sphinx doc, programmatic progress monitor) will come later with other issues! Thanks for your input on this!

bartdag commented 10 years ago

Now in @master