istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.17k stars 323 forks source link

maxdepth can not large than 2 #241

Closed anthony9981 closed 3 years ago

anthony9981 commented 3 years ago

Hi, When I try to feed an url with this: curl localhost:5343/feed -H "Content-Type: application/json" -d '{"url": "https://domain.com", "appid": "gd2", "crawlid": "gdcrawl2", "maxdepth": 5, "allowed_domains": ["domain.com"]}'

I always got: {"message": "Did not find schema to validate request", "parsed": true, "valid": false, "logger": "kafka-monitor", "timestamp": "2020-12-18T03:06:51.564078Z", "data": {"url":...

But when decrease maxdepth to 2, the crawler worked.

So what exactly maxdepth for? As I understand is maxdepth=0: only get the current page. I guess this is the default value maxdepth=1 maxdepth=2 What exactly maxdepth value for?

mrasoolmirzaei commented 3 years ago

By default, scrapy_cluster won't crawl websites with maxdepth larger than 3. You should change the schema first. To do this, login to your kafka_monitor container:

1- docker exec -it container_id bash 2- cd plugins 3- edit scraper_schema.json (change max value for maxdepth from 3 to anything you want)

from this point, you can crawl websites for maxdepth more than min value and less than the max value you just set.

anthony9981 commented 3 years ago

Hi @NeoArio, Thanks for your reply. Your answer helps me a alot. I have a question if you don't mind:

I have some knowable website, I need to crawl title and content in exactly selector for each of them. Q1: How I can predefine CSS selector for each of them then feed the monitor only the domain? Q2: And where I can take the scraped items then store to data base like elasticsearch? I tried with pipelines (scrapy-elasticsearch) but it will ton of additional request to es server.

Sorry I'm new on scrapy. This is awesome! Best regards,

mrasoolmirzaei commented 3 years ago

Hi! I hope you enjoy scraping :D

Q1: I have another database that stores websites CSS and XPath patterns. I don't know if what you want to try is really applicable. Q2: scraped items will be pushed to the demo.crawled_firehose topic: https://scrapy-cluster.readthedocs.io/en/latest/topics/kafka-monitor/api.html#kafka-topics Write a code to consume from this topic then do what you want with that data. Finally, you can send it to elasticsearch by another kafka pipline. I think it is better to insert in elasticsearch with bulk request, I mean each insertion contains 100 crawled link. Add a timeout beside this and you are perfect. 100 crawled link or 10 minutes are good conditions to insert in elasticsearch.

anthony9981 commented 3 years ago

Hi @NeoArio , Idea about database that stores the selector is so nice, why I don't think about it before :+1: Could you please show me your? I came from PHP to python then I'm here so Kafka is new for me:) Thanks to point me up:) Let me learn it deeper. Best regards,

madisonb commented 3 years ago

I'm happy to chat through custom implementations on Gitter, but per the guidelines I am going to close this issue as a "custom implementation" question which is beyond the scope of a true bug ticket/problem.

More generally - crawling at a depth beyond 2 gets your spider way into the weeds of the internet and is 99% of the time not useful for your actual request. If you wish to crawl at a greater depth you should also implement an allowed_domains filter or regex in the crawl api request to limit your crawler to a specific domain.

If you need to change anything else in the api spec for the request, you can do so at this file https://github.com/istresearch/scrapy-cluster/blob/master/kafka-monitor/plugins/scraper_schema.json