luohaha / CSpider

A scalable and convenient crawler framework in C:).
https://github.com/luohaha/CSpider
MIT License
367 stars 98 forks source link

CSpider

A scalable and convenient crawler framework based on C:).
中文文档.

Examples

Welcome any program using cspider to add links here.

INSTALL

make
make install
gcc -o test test.c -lcspider -I /usr/include/libxml2

using -lcspider to link dynamic link library, and -I /usr/include/libxml2 could let compiler to find libxml2's head files.

API

Initial settings

More functions

Tools

  1. Regular expressions:

    • int regexAll(const char *regex, char *str, char **res, int num, int flag);
      regex : regular matching rule
      str : the source string.
      res : array used for saving strings which is matched.
      num : array's size.
      flag : it could be REGEX_ALL and REGEX_NO_ALL, which means whether to return the whole string.
      This function returns the number of matched strings.

    • int match(char *regex, char *str);
      Whether it matches. Return 1 for yes, 0 for no.

  2. Using xpath to deal with html and xml:

    • int xpath(char *xml, char *path, char **res, int num);
      xml : prepared to parse.
      path : xpath's rule.
      res : array used for saving strings.
      num : array's size.
      This function returns the number of array which we get.
  3. Json:

    cspider contains cJSON. We could use it to parse json data. Usage is here

  4. Uriparser:

    • void joinall(char *baseuri, char **uris, int size); -> join all uris relative to baseuri

    baseuri: Base uri ( char *url in process func ) uris: regex / xpath extracted urls size: length of uris

    • char * join(char *baseuri, char *rel) -> join relative string to the base string

    baseuri: Base uri ( http://test.com/ ) rel: Relative url ( /a/b || ./a/b || ../a/b/./ and ... )

After regexAll and xpath, you should use freeStrings to free the string array which you get.

Example

Print the Github's main page's source code.

#include<cspider/spider.h>
/*
    custom process function
*/
void p(cspider_t *cspider, char *d, char *url, void *user_data) {

  printf("url -> %s\n", url);
  saveString(cspider, d, LOCK);

}
/*
    custom data persistence function
*/
void s(void *str, void *user_data) {
  char *get = (char *)str;
  FILE *file = (FILE*)user_data;
  fprintf(file, "%s\n", get);
  return;
}

int main() {
  cspider_t *spider = init_cspider(); 
  char *agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:42.0) Gecko/20100101 Firefox/42.0";

  cs_setopt_url(spider, "github.com");

  cs_setopt_useragent(spider, agent);
  //
  cs_setopt_process(spider, p, NULL);
  cs_setopt_save(spider, s, stdout);
  //set the thread's number
  cs_setopt_threadnum(spider, DOWNLOAD, 2);
  cs_setopt_threadnum(spider, SAVE, 2);

  return cs_run(spider);
}