Closed lljrsr closed 8 years ago
Hey! So what kind of behavior you want from your crawler? It finds redirect, and follows it, and it's up to you how to handle such situations.
Well first of all I think that it's strange that the website http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit
is the only one that is being requested by the crawler and the website http://www.reg.ru/domain/new-gtlds
was discarded after the maximum of redirections were reached. With http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit
this is not the case.
I do not want this kind of behaviour from my crawler (especially my second point).
Please fix the links, I can't get anything. You have only 2 seeds?
Oh sry. My comment was wrong. I edited it now. :P . What do you mean by "fix the links" ? No I have ~400 seeds. This behaviour occurs after crawling for a while. I could send you the seeds if you want. The URL is acutally not part of my seeds file.
You crawler found an infinite (probably) redirect chains. Such WWW artefacts consume crawler resources and do not produce any value, that's why Scrapy has a protection from them. I don't know your complete use case, so I can't recommend anything particular. But you have options: continue following redirects indefinitely (so your crawl will never end) or stop after N redirects (like now) or postpone downloading of such URLs (hoping that redirect will disappear/will be fixed).
If you want to tweak this mechanism try tuning REDIRECT-MAX-TIMES
option (http://doc.scrapy.org/en/latest/topics/settings.html#redirect-max-times)
Thanks for your explanation.
The issue was that my crawler kept getting redirected indefinitely although it should have stopped after 20 redirects. However I may have overlooked that the DW shut down unexpectedly during that time. I will have to look further into that. You can close this issue if you want and I will open a new one when I know the exact cause.
Right now I have every worker running in his own window. What is a good practice to notice the latter behavior (e.g. DB worker shutting down) without having to switch between multiple windows all the time (spiders, SWs, DWs and broker all have their own window)?
First, please report any errors causing workers to crash, this would help to make Frontera more reliable. There are few options I see:
respawn
samza for running the crawler processes, and log the output to the log files. You will be able to track the problems with logs, and upstart will restart id process dies.Thanks a lot :) . I still seem to be having the same issue. One of my crawlers seems to get stuck in a redirect loop after crawling for a while. When I disable AJAX crawl and redirects all together, one of my spiders just seems to stop crawling after a while. I will debug a bit further and report back whenever I find out more about the cause. Of course any ideas/help from your side would be awesome :) .
The spider which stops crawling after a while sends this output when I hit ctrl+c:
Unhandled Error
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 58, in run
self.crawler_process.start()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 269, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1194, in run
self.mainLoop()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 798, in runUntilCurrent
f(*a, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 283, in _graceful_stop_reactor
d = self.stop()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 192, in stop
return defer.DeferredList([c.stop() for c in self.crawlers])
exceptions.RuntimeError: Set changed size during iteration
2015-12-30 14:41:29 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 58, in run
self.crawler_process.start()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 269, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1194, in run
self.mainLoop()
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 798, in runUntilCurrent
f(*a, **kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 283, in _graceful_stop_reactor
d = self.stop()
File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 192, in stop
return defer.DeferredList([c.stop() for c in self.crawlers])
exceptions.RuntimeError: Set changed size during iteration
This is a Scrapy artefact, it shouldn't be connected with crawl stop.
Last week I did some debugging and I now think that it is related to this issue. I deactivated redirection and what happens now is that my spiders stop getting new batches after a while. After some debugging I found out that the spiders are getting marked as busy, but they are not getting marked as available after a few iterations (See here ). Because of this my DW does not push anything to the partitions after a while. My workaround is to restart the DW every few minutes. But that is not a proper solution. Since I get a lot of those "missing messages" warnings, I think it might be related. Now I do not know how to best debug this. Do you have any insight?
Please see my comment at https://github.com/scrapinghub/distributed-frontera/issues/24#issuecomment-170623431. It's strange that after few iterations your spiders are still marked as busy. May be it's time to start monitoring Scrapy downloader queue and overused buffer contents. You can dump their contents from spider code or use scrapy extension https://github.com/scrapy-plugins/scrapy-jsonrpc Perhaps crawling is stuck for some reason.
Sounds like a good idea. I will do that.
I just found out that my spider is crawling the same URL multiple times (which is probably related to this issue):
2016-01-14 11:36:59 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None)
2016-01-14 11:37:00 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:00 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:02 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:03 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:03 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:06 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:07 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:08 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:09 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:10 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:11 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:12 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:13 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:15 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:16 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:17 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:18 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:19 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:20 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:21 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:21 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None)
2016-01-14 11:37:22 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:23 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:23 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:23 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:24 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:26 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:27 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:28 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:29 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:30 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:31 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:35 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:36 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:37 [manager] DEBUG: (4) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:38 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:39 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:40 [manager] DEBUG: (4) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:41 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:43 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:44 [manager] DEBUG: (4) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:45 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:46 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:47 [manager] DEBUG: (4) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:48 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
So it would definitely be a good idea to have a look at the queue. However I was unable to locate a function which allows me to print out the queue neither in the spider code nor the jsonroc extension. Could you help me and point me to the right direction?
You need to dump the contents of crawler.engine.downloader.slots
dictionary. Reference to crawler is available in scrapy_jsonrpc and in spider code.
With max_next_requests of 160, my DW is telling me:
[backend] Finished: tries 3, hosts 7, requests 135
...
[backend] Got 9595 items for partition id 0
And I can see multiple duplicates in these items.
Shouldn't the DW push a maximum of 160 items? I do not understand the difference between items
and requests
here.
Here is an example of a duplicate:
['\xb3\x82\xb9\xe6\xe1\x8c\xa0\xc0M3R\xe5\x1e=\r\x8f%T\n\xa3', -1580664942, 'http://www.monsterenergy.com/', 0.47619047619047616]
['\x80-^M\xe0Z\xf1Bu\x98\xd6c/\xc9\xb2\x06G]\x9b?', -1580664942, 'https://www.monsterenergy.com/', 0.47619047619047616]
['\x80-^M\xe0Z\xf1Bu\x98\xd6c/\xc9\xb2\x06G]\x9b?', -1580664942, 'https://www.monsterenergy.com/', 0.47619047619047616]
EDIT: I just found out that these duplicates might come from a different referrer. Still I am not sure whether the DW is pushing too many items or if that is default behavior?
EDIT2: Okay they do not come from a different referrer. Here is my output:
Item# URL hexlify(rk_fprint) Score
27741:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
27747:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
27753:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
27759:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
27765:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
requests are items in this case. we can fix that to avoid confusing other people. Do you use HBase backend? It looks like a bug. It would be nice you could reproduce it. Are you using the latest frontera?
Yes I am using the latest frontera version and the HBase Backend. I can easily reproduce it by using your master branch instead of the hbasefix branch. In this line it fills the results with duplicates.
The question is if you put previously duplicate results in queue table? Because if so, then that's normal behavior. Can you debug it?
I just did a scan of my HBase queue table and cannot find duplicate ULRs or fingerprints.
To fix this, we need to reproduce it. HBase Queue first retrieves URLs, and then removes them from table. Therefore if you first retrieved and removed your duplicates you will not see them with later scans. So you could try disable generation of new batches after some moment (cmd line option in DB worker), or modify the hbase queue code, and write data to a second table on scheduling, this table will not be used for retrieving queue items, but for debugging.
These are some duplicates I could find:
In https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/backends/hbase.py#L228 the meta_map looks like this:
fprint ö¯àÓË+ì!¥âf½µ½ûc references this list:
[('0_0.52_0.53_1453117831535941', ['\x15\xf6\x91\xaf\xe0\xd3\xcb+\xec!\x04\xa5\xe2\x05f\xbd\xb5\xbd\xfbc', -1589438432, 'http://3u3a.deviantart.com/?offset=2490', 0.47619047619047616])]
fprint ¨JÆùÁóS =¯Ü references this list:
[('0_0.52_0.53_1453117831535941', ['\x1f\x89\x8e\x87\xa8J\xc6\xf9\xc1\xf3S =\x86\x8e\xaf\x11\x89\xdc\x0e', -1589438432, 'http://3u3a.deviantart.com/?offset=2480', 0.47619047619047616])]
fprint ø*^@Úа³-V0ës~ì references this list:
[('0_0.52_0.53_1453117831535941', ['\xf8*\xad\x18^@\xda\x07\x93\xd0\xb0\xb3-V\x020\xebs~\xec', -1079107546, 'http://dark-yoolia.deviantart.com/', 0.47619047619047616])]
In https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/backends/hbase.py#L230 fprint_map looks like this:
rk 0_0.52_0.53_1453119214454922 references this map:
['\x89\x9ed\x95j\x89J\x8b\xc7vzW|\x07\\\xc0\xaeN\x92\xe0', '\x94\xda\x0cU\x03=\xe3NB\\[\xea*Gx\xeb\xac|\xc1\xab', 'xQ\xfe?\x07\xc6\x10t\x89:\xbd\xe5T\x96>_\x14\xf0\x8bh', 'L\x06\xd8#\x7f\xab\xbe\xe0j\xcd\xf5\xb7\x12\xae\x97\x92\x18\xa0\x96)', 'w@l\t\xcc\xb6\x08$\xf90cF\xa7s\x92\x1e\x89\xdcin', '\xb9\xc3\xf0\xa3\x1f \xa6\xe7\x02\x8c9\xe0\x9e\xb5\x99\x8c\xdc\x99\xf9\x93', '\xdd^:\x83\xe7\xc3U\x07\x18\xa1\x0b~?`\xbeM29\x8a\xd3', '#/\xe9\xe0UZ\xce$b\xaa\xac\x87\r&\x85\x9e\x8ff\xd1\xa9', '\x12o\xbbQ\x04\xb0x\x8b\xef\xc2`\xeb\xa5>\rR\xcbg~\x84', ']\xa0\xd2\xa7P\x8b\x91R\xde\x18\xb3\xaf"\x0eK\x8e\xfb\xb4\xf3\x9c', '`\x14\xcf\x02\x8a\x19\xdd\rWp\xf4\x7f\xa9\xef\x88S\t\xdd@\x01', '^\xb6\xf35\x06#\xfaq\x97N\xcct\xa8`\x8a0n\xa2\x1f\xc1', '\xfc,\xa1\xec\x1a\x93_\xb3_\xe5\xe8\xe3C.M\x97\xce\x08Zq', '\x99M\x87\xbe?R\xca]\xbb\xd2\xa1\x13\xfb\xa2\x8a\xe4K\x91\xf9`', ';M\xe1\xf5\xd2\xc4\x93S\xc69!9\x83V\xb2\n\xa6\xea\x00\xb7', '\x10\xf5cHka^\xc5G|\x90\xd1\xbe\xa5\xc0\x85\x03\xa1`\xb7', 'T\xa3\xb6\xacg\xcf4B\x8a\xfbY\xe5\xfbk@\xf2\xaa\xc8\x83\xb0', '\x85\xdc\x8aB\x1c\xb9~!\x10\xce\x91\xf7w\xb8c\xb1\xb8\xe4*C', 'h\x94M\x1b\xa6:\t\xb7{\xe4\x93\x08\xf3O\x1e&\xef\xb3\x1cR', '\xa8?\xde\xebN\x00\x06o\xf0\x1b\x8eP\xc13\\\x1e\x9a ds', '4\xbc\x1b\x94d928$\xe8\x02\x9b\x03,Ts_\x9fb\xdf', "\xa56Fx\x17?\xad\xc2J\xbd7\x93x'\xd4\xee\xd1ncn", '\x16\xceI\xf2\x94I5\xc04E\xf1\xe47\x88\xae\x8d+\x997\xb3', 'Yi\xa6\xfd\x9b\xe8\x9c\x8e\xac\xe9\x0fx\xa1\xcb\xfb\xb0!\xbe\x8d\xbe', '\xa2\xe9\xa97\x07ghF(\xbf\x93,U\xb8W\x94\x01/\xf0\xd9', '\xd4\x98\xf3N\xab\xb4\xc9\xb6\x8bL*I\xda|\xc5f\x97\xc5J\x90', '\xa0m6W\xd4\x8b\xa8q\xf5\x1b\xd794\xec\xee{\xdb\xeb\xd9\xbd', '\x91*wG\xa4\x11\xbe\x9a\x18(\xc6\xeb\xa5\x9f_\x04\x03\xf3b\xf3', '+P"\xd1\xe3\x1b[ \x1f\x8aL\xa1\xd7\xda-\xa8\xef\'\xa9\xaf', '\xe9\x8d\x86\x83\xa1\xf3\xb2!1\x91\x8c\xf4\x1df\xf6U\x19\xfb\x12\xb0', '\x82\xc3\x17\xe3Iz\x95\xeam]-\x9a\xcb\xfe\x16\x1c\xcd\x9d\xf1\xa5', '\xc2\x93\xd9\x19q`[\xe7\x07\xc8k\xad>I\x19\\\xb2\xc3N\x1c', '*k6\xfa\xa1\x93\xec\xaf(\x12\xabS\xbc-\x92f\x8a&\xd8\xac', '\xb5+"\xdb\xf3\x1er:\x9b\x92X=\n\xeeaI\xa3f c', '\xf5p\xd9\x9e)\x9a\xda\x8c\x1d\xc5\x88\t\x1b\xa7\xde?\xa2\x98D\xd5', '`\x1d\xb8X\xa6q\x07\x8c\xeb\xaf\xb0\xc5\xc4\xa8\tx\x07\x17\x11\x10', '\xefH\x95\xf7\xf0\xf4\xbb\xffQ\xef1\xf3\x00\x05~\xbb\x05\x15\x14\xb1', '\xf5\xb1\x0c.D\x124\x9e\xc8\xd1\xed\xae5\xf3\x0b\x92P\xe2\xde,', ')\xdb{\xe4\xa2\x08\xb4&9\xcc\xe9\x11\x019\x95$Tp\xfe\xb3', '\xfe\xa4\xcb\xb64X\x06DP\xc4/)\x01\xdd\xd3{D\x98{\x07', 'T\xf8\xa9T\x103b\xd4\xdf\xbb\xc6\xba\xf3\xeb\x88\xcd\xc0\xf5\x12\xd5', '/\xaf\x929\x9b\x86J\xdc3o5\xefB\x94\x86\x16B\x7ff\x83', '\xfa\x18I\x0f\xd4\xc5\x0fj\\\x08\x87pP^\xe8\xcc9^JO', '\xda\xb6\xb6[\x12\xd6\x01@\xc1",\xa2R\x06\x8bo{\xc2\x07\x0c', 'e"\xf3j#y1\xcb\xbf\xbeQ\x12\x14S\xce\xe8y\n{s', '?\x90"h~\x01\xeb\xb4\xe9\xcc\xd8\x10xA\xbb\x9a<\x17+\x95', " \x1a'\xe1\xcc6\xf9\x87\x11]\x9a\xcf\xcdj\xdd\xd9\xc2\xb6:'", '\x1f\xd1\xda\xbbZ\xe2\x81\x08(\x9c\xceI\x1c\xa3\x82\xc4\xa3`\x9a\xd2', '\x07&;\xd3\xa5\x05\xc3\x0e\x1b?3$\x93tUi\xbc\xed*\x1a', "\xeb>%H\xaf.\xb7\xe1'\xad]\x8c\x81@E\x17\xe2P\xe9\x07", '\xa2!O9I$\x17\xf0C|\xce\x9d\xaa\xf4\x0c\x15\x997\xde\x1e', '&\x0e\t\xde\xef\xb2D\xd3!?\xedY\x00\xa5`\n\x97\xf7\x01\x06', 'a\x1c\xc1\xcc\xe0\x8b]\xe4\xeb\xef(\x80\xa0\x8ap(\x87D\xf4\xfa', "\x85\x88\xb0\x9b\x89K'\xa3,\x81\xa3g\xd3(\xf5H}(2\xda", '\x9b\xf1\x02"y\xf9\x94^\xbaG\xfdV\xb6\xef\xad8T\xb1_\xc9', '}\x05fI\xe2>\xf5\xe6\xe6\xa2\xaf\xb5\xcb\x1d\x8f\xf4|\xdf\xc4\xe7', 'N\xa9\xb7\xa1\x8e\xa3\xae\x9f\x1dx\xcbG\xdc\xedU\xda\xf7c\xc2p', ' \xc7zz\xc4\x05\xe1h\xa0\x9cb\x80\x9e\xf1\x9aTtk\x03*', '\xf9\xb9\xa72M\xa3\xdf\xad\x1e\xa7\xa9\x953Y&Q\x0f\x9c\xb1?', 'V\xad\xe3\xed\xa4\x9fe~\x92\x0e\xcf\x05\xa3\xde\xf3G2\xc5\x082', '\x16w\xd2\xe9\xec\n\x05k\x14\x95\xcb\xdat\x1aF\xe4G\x86\x9d-', '-\x7f\x1d\n\xda\xdf*\xcf{\xa5\xb9\xb3\xcd\xdf9M\xafw\xfd\xb1', 'y\xcd@w\xf2\x89\xdd\xb9\xd2w\xa9rD\xc8\xefJ\x97\x0c\xaf\xfb', '\xeaG2\x1f\x0f\x8b\x9b\xba\xc8\x8d{\xbfv\xf65Um\x14\xc3\xa0', '\tvp\x8d\xa3\xbecY<\xec\xc9\x8e\x10\xce!\xd6\x8a@HR', '\xf5\x1d\xdd\xd4\xb3\x98\x89\xee\x83\xee\xcd3\x13\xd3\x99\rN\x9e\xcbr', '\xeb\x19\x97>\xe5\xfe^\xa3R\xaa0E\x10\xfdEaGv\x07s', '4\xf1\x85\x8fs-\xcd\x8b\x19\x13\x98\xb6\x18(\xd7o\x1a\xa1\x7f\xbe', "\xd3\xd8*\xda\x13\x8c5K\xad\xb2S.+\xda\xd2'\xaa\xcd\xb5\xe1", '\xb7\x85\xd6\x01S\xee\xb7\xfc\x1a\xa2(\x94\x05N\xebF&O\xea\xa4', 'U\xef\xb2\xa4b\\\x83\x15|{\x9e[\xf5\x8b\x8a0\xb2\x02\xd9\xbe', '\xea6\x16\x0f\xa7\xfa1a\xf7\xb9\x17\xb0\xedP\x8d\x0c\x99+[o', '\xe9\xfa\xc1GV`\x9c\xbc\xf0\x80D\x02\xfd\xcd.\x06\xcdn\x03>', '>\xaa\x92\x08[\x16\xc8\xe9\x16\xe9\xf7D#\xae\xb1l\x90\xbb\xb0\xbc', '\xca\x83o^|\x97e#\xeeT@\xf4\xbb\x89\x06E\xf0{\x82)', '\xcf\xdf*\xc5[\x83\xab\x7f\xe1\x9c\xb3\xb2\xf6\xa4\x1c\xd2\xcd\x0c\xeb\x8d', 'G%F\x82\xf5\xa8\x9b\xaa\xec\xd0\xe48P\xfb\xb7%c\xc68%', '\xb9\x1d\x1e`\xa5\xea\x0b\xcds\x94\xa2\xe8\xe7\x07|\x85nZ,\xcd', '\xd92\x89+\xbc_\xf9\x00\xdf80\xc7PT\x19\xd4\x8c3\xf4\x87', '\xa0\xf6!\xfeu<\xa1\xc5f\xf4zv\x8c6S\xabh\xe0\x82m', '\x1d5\xea\xfa\x0c\xf4r\x84\x87\xf1\xb1pcK\x1d\x0c\x05t\x00\xfd', 'Z\xa4r/\x94\x07\xde\xf9Z-\xaf\x96\x89\xdfM\xb5%\xbb?\x1a', '\xc8\x89\xda\x9e\x95-\x1cV\xe3\xb3\x1c\xca\xf4s\xe3Hl1\x05d', 'V\x96\xb7\n\xb9\x1b\xf5\x1f"I\xc8\x94m\xfe\xe0\xf6%*;\xf4', 'y\xe7\x7f\xea-\x93\xd0R$\xd0\x9a\xb4\x0f\xd3p\x15\xc4\xadFh', '\xf6\xa5\x0f\xcf@G\xe8\x1e\xc6{\x94\xdd\xd5\x10Wx\xf9I\x8fM', '\x0f\xda\x98\xba\xd3\xc9\x9f\xd6\xc9g\x18\xfd\x17\xb2\x90E\xfb\x954\xea', '\xe6\xa3\xd5)\x83\x86\xe3~"\x9c\xe5\x9a\xdc}K\xd4\xd1\x13\x01\x94', 'f\x95\\\xb6\xb4e\xf8\x16\xc2s\xfb\xaaF-\xe1=q_\xf1\x8b', '\x8df#\xd6)\xfb\xf3\xf8\x1c\x0f\x92\xd9\xb3p5\xaez\xd7\t\xdc', '\xc7\xfc\xae\xc8S\x1b.\xf3\x1c\x0b\x8f\x16T\x87X\x16\xd2]U\x9f', '\xa6P\x17\x84?\xb0l\x888\xb6\x8c\x82\xf1(\x14m\x89\xae$\x01', '\xc1\x0b\x00\xf8\xec\xf3(\x1cR\xf4\xbea\n\xfb\xa5\xc4\x9d\x10\xf9\x0c', 'Nq#\x8ey\xea4L\x94G)\\\x12\x7f1\x9a\x94\xc2=\xb2', '\xf1\x8b\xc8\xcf\xce\xe7\x9f\x88\xe3\xb3v\r\x00\x9f\x8f\xb5\x94\x91=[', '\xbem\x80Oi\x8e\xbd\xd3\xe6iv\x1cx\xe5\xfb\xc0J\x9f*\xab', '\x136\xfc\xe3D\x8a\x17\x9e;\xb3\x8c\xf4\xa3\x1f\xce\xd9\xb3\x0e@p', 'v\xb5\xf4!\r\x8d\x1a\x92\xb6\x9b\xe3\xe5\x8b\xb4\xa6\x1a:Z\xb3\t', 'n\xde3\xfdY_\xca\x15\xa9\x9e\xcf+n\x1f\x11\x9b\xcb\x11D\xbd', '"\xefV\xce\x02\x91\xae\xdd\xa3\xee\xe2\xf8\xc4$\xa1\xfdx}\x17\xcf']
rk 0_0.52_0.53_1453119214454922 references this map:
['\x89\x9ed\x95j\x89J\x8b\xc7vzW|\x07\\\xc0\xaeN\x92\xe0', '\x94\xda\x0cU\x03=\xe3NB\\[\xea*Gx\xeb\xac|\xc1\xab', 'xQ\xfe?\x07\xc6\x10t\x89:\xbd\xe5T\x96>_\x14\xf0\x8bh', 'L\x06\xd8#\x7f\xab\xbe\xe0j\xcd\xf5\xb7\x12\xae\x97\x92\x18\xa0\x96)', 'w@l\t\xcc\xb6\x08$\xf90cF\xa7s\x92\x1e\x89\xdcin', '\xb9\xc3\xf0\xa3\x1f \xa6\xe7\x02\x8c9\xe0\x9e\xb5\x99\x8c\xdc\x99\xf9\x93', '\xdd^:\x83\xe7\xc3U\x07\x18\xa1\x0b~?`\xbeM29\x8a\xd3', '#/\xe9\xe0UZ\xce$b\xaa\xac\x87\r&\x85\x9e\x8ff\xd1\xa9', '\x12o\xbbQ\x04\xb0x\x8b\xef\xc2`\xeb\xa5>\rR\xcbg~\x84', ']\xa0\xd2\xa7P\x8b\x91R\xde\x18\xb3\xaf"\x0eK\x8e\xfb\xb4\xf3\x9c', '`\x14\xcf\x02\x8a\x19\xdd\rWp\xf4\x7f\xa9\xef\x88S\t\xdd@\x01', '^\xb6\xf35\x06#\xfaq\x97N\xcct\xa8`\x8a0n\xa2\x1f\xc1', '\xfc,\xa1\xec\x1a\x93_\xb3_\xe5\xe8\xe3C.M\x97\xce\x08Zq', '\x99M\x87\xbe?R\xca]\xbb\xd2\xa1\x13\xfb\xa2\x8a\xe4K\x91\xf9`', ';M\xe1\xf5\xd2\xc4\x93S\xc69!9\x83V\xb2\n\xa6\xea\x00\xb7', '\x10\xf5cHka^\xc5G|\x90\xd1\xbe\xa5\xc0\x85\x03\xa1`\xb7', 'T\xa3\xb6\xacg\xcf4B\x8a\xfbY\xe5\xfbk@\xf2\xaa\xc8\x83\xb0', '\x85\xdc\x8aB\x1c\xb9~!\x10\xce\x91\xf7w\xb8c\xb1\xb8\xe4*C', 'h\x94M\x1b\xa6:\t\xb7{\xe4\x93\x08\xf3O\x1e&\xef\xb3\x1cR', '\xa8?\xde\xebN\x00\x06o\xf0\x1b\x8eP\xc13\\\x1e\x9a ds', '4\xbc\x1b\x94d928$\xe8\x02\x9b\x03,Ts_\x9fb\xdf', "\xa56Fx\x17?\xad\xc2J\xbd7\x93x'\xd4\xee\xd1ncn", '\x16\xceI\xf2\x94I5\xc04E\xf1\xe47\x88\xae\x8d+\x997\xb3', 'Yi\xa6\xfd\x9b\xe8\x9c\x8e\xac\xe9\x0fx\xa1\xcb\xfb\xb0!\xbe\x8d\xbe', '\xa2\xe9\xa97\x07ghF(\xbf\x93,U\xb8W\x94\x01/\xf0\xd9', '\xd4\x98\xf3N\xab\xb4\xc9\xb6\x8bL*I\xda|\xc5f\x97\xc5J\x90', '\xa0m6W\xd4\x8b\xa8q\xf5\x1b\xd794\xec\xee{\xdb\xeb\xd9\xbd', '\x91*wG\xa4\x11\xbe\x9a\x18(\xc6\xeb\xa5\x9f_\x04\x03\xf3b\xf3', '+P"\xd1\xe3\x1b[ \x1f\x8aL\xa1\xd7\xda-\xa8\xef\'\xa9\xaf', '\xe9\x8d\x86\x83\xa1\xf3\xb2!1\x91\x8c\xf4\x1df\xf6U\x19\xfb\x12\xb0', '\x82\xc3\x17\xe3Iz\x95\xeam]-\x9a\xcb\xfe\x16\x1c\xcd\x9d\xf1\xa5', '\xc2\x93\xd9\x19q`[\xe7\x07\xc8k\xad>I\x19\\\xb2\xc3N\x1c', '*k6\xfa\xa1\x93\xec\xaf(\x12\xabS\xbc-\x92f\x8a&\xd8\xac', '\xb5+"\xdb\xf3\x1er:\x9b\x92X=\n\xeeaI\xa3f c', '\xf5p\xd9\x9e)\x9a\xda\x8c\x1d\xc5\x88\t\x1b\xa7\xde?\xa2\x98D\xd5', '`\x1d\xb8X\xa6q\x07\x8c\xeb\xaf\xb0\xc5\xc4\xa8\tx\x07\x17\x11\x10', '\xefH\x95\xf7\xf0\xf4\xbb\xffQ\xef1\xf3\x00\x05~\xbb\x05\x15\x14\xb1', '\xf5\xb1\x0c.D\x124\x9e\xc8\xd1\xed\xae5\xf3\x0b\x92P\xe2\xde,', ')\xdb{\xe4\xa2\x08\xb4&9\xcc\xe9\x11\x019\x95$Tp\xfe\xb3', '\xfe\xa4\xcb\xb64X\x06DP\xc4/)\x01\xdd\xd3{D\x98{\x07', 'T\xf8\xa9T\x103b\xd4\xdf\xbb\xc6\xba\xf3\xeb\x88\xcd\xc0\xf5\x12\xd5', '/\xaf\x929\x9b\x86J\xdc3o5\xefB\x94\x86\x16B\x7ff\x83', '\xfa\x18I\x0f\xd4\xc5\x0fj\\\x08\x87pP^\xe8\xcc9^JO', '\xda\xb6\xb6[\x12\xd6\x01@\xc1",\xa2R\x06\x8bo{\xc2\x07\x0c', 'e"\xf3j#y1\xcb\xbf\xbeQ\x12\x14S\xce\xe8y\n{s', '?\x90"h~\x01\xeb\xb4\xe9\xcc\xd8\x10xA\xbb\x9a<\x17+\x95', " \x1a'\xe1\xcc6\xf9\x87\x11]\x9a\xcf\xcdj\xdd\xd9\xc2\xb6:'", '\x1f\xd1\xda\xbbZ\xe2\x81\x08(\x9c\xceI\x1c\xa3\x82\xc4\xa3`\x9a\xd2', '\x07&;\xd3\xa5\x05\xc3\x0e\x1b?3$\x93tUi\xbc\xed*\x1a', "\xeb>%H\xaf.\xb7\xe1'\xad]\x8c\x81@E\x17\xe2P\xe9\x07", '\xa2!O9I$\x17\xf0C|\xce\x9d\xaa\xf4\x0c\x15\x997\xde\x1e', '&\x0e\t\xde\xef\xb2D\xd3!?\xedY\x00\xa5`\n\x97\xf7\x01\x06', 'a\x1c\xc1\xcc\xe0\x8b]\xe4\xeb\xef(\x80\xa0\x8ap(\x87D\xf4\xfa', "\x85\x88\xb0\x9b\x89K'\xa3,\x81\xa3g\xd3(\xf5H}(2\xda", '\x9b\xf1\x02"y\xf9\x94^\xbaG\xfdV\xb6\xef\xad8T\xb1_\xc9', '}\x05fI\xe2>\xf5\xe6\xe6\xa2\xaf\xb5\xcb\x1d\x8f\xf4|\xdf\xc4\xe7', 'N\xa9\xb7\xa1\x8e\xa3\xae\x9f\x1dx\xcbG\xdc\xedU\xda\xf7c\xc2p', ' \xc7zz\xc4\x05\xe1h\xa0\x9cb\x80\x9e\xf1\x9aTtk\x03*', '\xf9\xb9\xa72M\xa3\xdf\xad\x1e\xa7\xa9\x953Y&Q\x0f\x9c\xb1?', 'V\xad\xe3\xed\xa4\x9fe~\x92\x0e\xcf\x05\xa3\xde\xf3G2\xc5\x082', '\x16w\xd2\xe9\xec\n\x05k\x14\x95\xcb\xdat\x1aF\xe4G\x86\x9d-', '-\x7f\x1d\n\xda\xdf*\xcf{\xa5\xb9\xb3\xcd\xdf9M\xafw\xfd\xb1', 'y\xcd@w\xf2\x89\xdd\xb9\xd2w\xa9rD\xc8\xefJ\x97\x0c\xaf\xfb', '\xeaG2\x1f\x0f\x8b\x9b\xba\xc8\x8d{\xbfv\xf65Um\x14\xc3\xa0', '\tvp\x8d\xa3\xbecY<\xec\xc9\x8e\x10\xce!\xd6\x8a@HR', '\xf5\x1d\xdd\xd4\xb3\x98\x89\xee\x83\xee\xcd3\x13\xd3\x99\rN\x9e\xcbr', '\xeb\x19\x97>\xe5\xfe^\xa3R\xaa0E\x10\xfdEaGv\x07s', '4\xf1\x85\x8fs-\xcd\x8b\x19\x13\x98\xb6\x18(\xd7o\x1a\xa1\x7f\xbe', "\xd3\xd8*\xda\x13\x8c5K\xad\xb2S.+\xda\xd2'\xaa\xcd\xb5\xe1", '\xb7\x85\xd6\x01S\xee\xb7\xfc\x1a\xa2(\x94\x05N\xebF&O\xea\xa4', 'U\xef\xb2\xa4b\\\x83\x15|{\x9e[\xf5\x8b\x8a0\xb2\x02\xd9\xbe', '\xea6\x16\x0f\xa7\xfa1a\xf7\xb9\x17\xb0\xedP\x8d\x0c\x99+[o', '\xe9\xfa\xc1GV`\x9c\xbc\xf0\x80D\x02\xfd\xcd.\x06\xcdn\x03>', '>\xaa\x92\x08[\x16\xc8\xe9\x16\xe9\xf7D#\xae\xb1l\x90\xbb\xb0\xbc', '\xca\x83o^|\x97e#\xeeT@\xf4\xbb\x89\x06E\xf0{\x82)', '\xcf\xdf*\xc5[\x83\xab\x7f\xe1\x9c\xb3\xb2\xf6\xa4\x1c\xd2\xcd\x0c\xeb\x8d', 'G%F\x82\xf5\xa8\x9b\xaa\xec\xd0\xe48P\xfb\xb7%c\xc68%', '\xb9\x1d\x1e`\xa5\xea\x0b\xcds\x94\xa2\xe8\xe7\x07|\x85nZ,\xcd', '\xd92\x89+\xbc_\xf9\x00\xdf80\xc7PT\x19\xd4\x8c3\xf4\x87', '\xa0\xf6!\xfeu<\xa1\xc5f\xf4zv\x8c6S\xabh\xe0\x82m', '\x1d5\xea\xfa\x0c\xf4r\x84\x87\xf1\xb1pcK\x1d\x0c\x05t\x00\xfd', 'Z\xa4r/\x94\x07\xde\xf9Z-\xaf\x96\x89\xdfM\xb5%\xbb?\x1a', '\xc8\x89\xda\x9e\x95-\x1cV\xe3\xb3\x1c\xca\xf4s\xe3Hl1\x05d', 'V\x96\xb7\n\xb9\x1b\xf5\x1f"I\xc8\x94m\xfe\xe0\xf6%*;\xf4', 'y\xe7\x7f\xea-\x93\xd0R$\xd0\x9a\xb4\x0f\xd3p\x15\xc4\xadFh', '\xf6\xa5\x0f\xcf@G\xe8\x1e\xc6{\x94\xdd\xd5\x10Wx\xf9I\x8fM', '\x0f\xda\x98\xba\xd3\xc9\x9f\xd6\xc9g\x18\xfd\x17\xb2\x90E\xfb\x954\xea', '\xe6\xa3\xd5)\x83\x86\xe3~"\x9c\xe5\x9a\xdc}K\xd4\xd1\x13\x01\x94', 'f\x95\\\xb6\xb4e\xf8\x16\xc2s\xfb\xaaF-\xe1=q_\xf1\x8b', '\x8df#\xd6)\xfb\xf3\xf8\x1c\x0f\x92\xd9\xb3p5\xaez\xd7\t\xdc', '\xc7\xfc\xae\xc8S\x1b.\xf3\x1c\x0b\x8f\x16T\x87X\x16\xd2]U\x9f', '\xa6P\x17\x84?\xb0l\x888\xb6\x8c\x82\xf1(\x14m\x89\xae$\x01', '\xc1\x0b\x00\xf8\xec\xf3(\x1cR\xf4\xbea\n\xfb\xa5\xc4\x9d\x10\xf9\x0c', 'Nq#\x8ey\xea4L\x94G)\\\x12\x7f1\x9a\x94\xc2=\xb2', '\xf1\x8b\xc8\xcf\xce\xe7\x9f\x88\xe3\xb3v\r\x00\x9f\x8f\xb5\x94\x91=[', '\xbem\x80Oi\x8e\xbd\xd3\xe6iv\x1cx\xe5\xfb\xc0J\x9f*\xab', '\x136\xfc\xe3D\x8a\x17\x9e;\xb3\x8c\xf4\xa3\x1f\xce\xd9\xb3\x0e@p', 'v\xb5\xf4!\r\x8d\x1a\x92\xb6\x9b\xe3\xe5\x8b\xb4\xa6\x1a:Z\xb3\t', 'n\xde3\xfdY_\xca\x15\xa9\x9e\xcf+n\x1f\x11\x9b\xcb\x11D\xbd', '"\xefV\xce\x02\x91\xae\xdd\xa3\xee\xe2\xf8\xc4$\xa1\xfdx}\x17\xcf']
Different fingerprints reference the same row key, which is then called multiple times, which then creates the duplicates. The row key should only be called once in Line 230. I will now look further into this. Any input is appreciated :) .
I printed out my HBase queue after disabling batch creation with: echo "scan 'crawler:queue'" | ./hbase shell > myTest
. I still was unable to locate any duplicates within the table with grep
. The duplicates should be easily noticeable since nearly every URL was sent more than 4 times to my crawler.
Right now I am thinking that this is a bug in the HBase backend (As I mentioned in my pull reqyuest). Any more infos that you want from me?
Fixed with #94
I am using the development version of distributed-frontera, frontera and scrapy for crawling. After a while my spider keeps getting stuck in a redirect loop. Restarting the spider helps, but after a while this happens:
This does not seem to be an issue with distributed-frontera since I could not find any code related to redirecting there.