scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.29k stars 216 forks source link

Redirect loop when using distributed-frontera #87

Closed lljrsr closed 8 years ago

lljrsr commented 8 years ago

I am using the development version of distributed-frontera, frontera and scrapy for crawling. After a while my spider keeps getting stuck in a redirect loop. Restarting the spider helps, but after a while this happens:

2015-12-21 17:23:22 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:23 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:24 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:26 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:27 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:32 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:33 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:35 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:35 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:36 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:37 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
2015-12-21 17:23:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:23:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-12-21 17:23:43 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
...
2015-12-21 17:45:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
2015-12-21 17:45:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

This does not seem to be an issue with distributed-frontera since I could not find any code related to redirecting there.

sibiryakov commented 8 years ago

Hey! So what kind of behavior you want from your crawler? It finds redirect, and follows it, and it's up to you how to handle such situations.

lljrsr commented 8 years ago

Well first of all I think that it's strange that the website http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit is the only one that is being requested by the crawler and the website http://www.reg.ru/domain/new-gtlds was discarded after the maximum of redirections were reached. With http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit this is not the case. I do not want this kind of behaviour from my crawler (especially my second point).

sibiryakov commented 8 years ago

Please fix the links, I can't get anything. You have only 2 seeds?

lljrsr commented 8 years ago

Oh sry. My comment was wrong. I edited it now. :P . What do you mean by "fix the links" ? No I have ~400 seeds. This behaviour occurs after crawling for a while. I could send you the seeds if you want. The URL is acutally not part of my seeds file.

sibiryakov commented 8 years ago

You crawler found an infinite (probably) redirect chains. Such WWW artefacts consume crawler resources and do not produce any value, that's why Scrapy has a protection from them. I don't know your complete use case, so I can't recommend anything particular. But you have options: continue following redirects indefinitely (so your crawl will never end) or stop after N redirects (like now) or postpone downloading of such URLs (hoping that redirect will disappear/will be fixed).

If you want to tweak this mechanism try tuning REDIRECT-MAX-TIMES option (http://doc.scrapy.org/en/latest/topics/settings.html#redirect-max-times)

lljrsr commented 8 years ago

Thanks for your explanation.

The issue was that my crawler kept getting redirected indefinitely although it should have stopped after 20 redirects. However I may have overlooked that the DW shut down unexpectedly during that time. I will have to look further into that. You can close this issue if you want and I will open a new one when I know the exact cause.

Right now I have every worker running in his own window. What is a good practice to notice the latter behavior (e.g. DB worker shutting down) without having to switch between multiple windows all the time (spiders, SWs, DWs and broker all have their own window)?

sibiryakov commented 8 years ago

First, please report any errors causing workers to crash, this would help to make Frontera more reliable. There are few options I see:

lljrsr commented 8 years ago

Thanks a lot :) . I still seem to be having the same issue. One of my crawlers seems to get stuck in a redirect loop after crawling for a while. When I disable AJAX crawl and redirects all together, one of my spiders just seems to stop crawling after a while. I will debug a bit further and report back whenever I find out more about the cause. Of course any ideas/help from your side would be awesome :) .

lljrsr commented 8 years ago

The spider which stops crawling after a while sends this output when I hit ctrl+c:

Unhandled Error
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 58, in run
    self.crawler_process.start()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 269, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1194, in run
    self.mainLoop()
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 798, in runUntilCurrent
    f(*a, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 283, in _graceful_stop_reactor
    d = self.stop()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 192, in stop
    return defer.DeferredList([c.stop() for c in self.crawlers])
exceptions.RuntimeError: Set changed size during iteration
2015-12-30 14:41:29 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/scrapy/commands/crawl.py", line 58, in run
    self.crawler_process.start()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 269, in start
    reactor.run(installSignalHandlers=False)  # blocking call
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1194, in run
    self.mainLoop()
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 1203, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "/usr/local/lib/python2.7/dist-packages/twisted/internet/base.py", line 798, in runUntilCurrent
    f(*a, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 283, in _graceful_stop_reactor
    d = self.stop()
  File "/usr/local/lib/python2.7/dist-packages/scrapy/crawler.py", line 192, in stop
    return defer.DeferredList([c.stop() for c in self.crawlers])
exceptions.RuntimeError: Set changed size during iteration
sibiryakov commented 8 years ago

This is a Scrapy artefact, it shouldn't be connected with crawl stop.

lljrsr commented 8 years ago

Last week I did some debugging and I now think that it is related to this issue. I deactivated redirection and what happens now is that my spiders stop getting new batches after a while. After some debugging I found out that the spiders are getting marked as busy, but they are not getting marked as available after a few iterations (See here ). Because of this my DW does not push anything to the partitions after a while. My workaround is to restart the DW every few minutes. But that is not a proper solution. Since I get a lot of those "missing messages" warnings, I think it might be related. Now I do not know how to best debug this. Do you have any insight?

sibiryakov commented 8 years ago

Please see my comment at https://github.com/scrapinghub/distributed-frontera/issues/24#issuecomment-170623431. It's strange that after few iterations your spiders are still marked as busy. May be it's time to start monitoring Scrapy downloader queue and overused buffer contents. You can dump their contents from spider code or use scrapy extension https://github.com/scrapy-plugins/scrapy-jsonrpc Perhaps crawling is stuck for some reason.

lljrsr commented 8 years ago

Sounds like a good idea. I will do that.

lljrsr commented 8 years ago

I just found out that my spider is crawling the same URL multiple times (which is probably related to this issue):

2016-01-14 11:36:59 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None)
2016-01-14 11:37:00 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:00 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:02 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:03 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:03 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:06 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:07 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:08 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:09 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:10 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:11 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:12 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:13 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:15 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:16 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:17 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:18 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:19 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:20 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:21 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:21 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None)
2016-01-14 11:37:22 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:23 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:23 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:23 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:24 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:26 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:27 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:28 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:29 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:30 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:31 [manager] DEBUG: (3) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:35 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:36 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:37 [manager] DEBUG: (4) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:38 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:39 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:40 [manager] DEBUG: (4) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:41 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:43 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:44 [manager] DEBUG: (4) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:45 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']
2016-01-14 11:37:46 [scrapy] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=Ff4MmpnUai8>
{'url': 'https://www.youtube.com/watch?v=Ff4MmpnUai8'}
2016-01-14 11:37:47 [manager] DEBUG: (4) PAGE_CRAWLED url=https://www.youtube.com/watch?v=Ff4MmpnUai8 status=200 links=68
2016-01-14 11:37:48 [scrapy] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=Ff4MmpnUai8> (referer: None) ['cached']

So it would definitely be a good idea to have a look at the queue. However I was unable to locate a function which allows me to print out the queue neither in the spider code nor the jsonroc extension. Could you help me and point me to the right direction?

sibiryakov commented 8 years ago

You need to dump the contents of crawler.engine.downloader.slots dictionary. Reference to crawler is available in scrapy_jsonrpc and in spider code.

lljrsr commented 8 years ago

With max_next_requests of 160, my DW is telling me:

[backend] Finished: tries 3, hosts 7, requests 135
...
[backend] Got 9595 items for partition id 0 

And I can see multiple duplicates in these items. Shouldn't the DW push a maximum of 160 items? I do not understand the difference between items and requests here.

lljrsr commented 8 years ago

Here is an example of a duplicate:

['\xb3\x82\xb9\xe6\xe1\x8c\xa0\xc0M3R\xe5\x1e=\r\x8f%T\n\xa3', -1580664942, 'http://www.monsterenergy.com/',  0.47619047619047616]
['\x80-^M\xe0Z\xf1Bu\x98\xd6c/\xc9\xb2\x06G]\x9b?',            -1580664942, 'https://www.monsterenergy.com/', 0.47619047619047616]
['\x80-^M\xe0Z\xf1Bu\x98\xd6c/\xc9\xb2\x06G]\x9b?',            -1580664942, 'https://www.monsterenergy.com/', 0.47619047619047616]

EDIT: I just found out that these duplicates might come from a different referrer. Still I am not sure whether the DW is pushing too many items or if that is default behavior?

EDIT2: Okay they do not come from a different referrer. Here is my output:

Item# URL                           hexlify(rk_fprint)                        Score
27741:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
27747:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
27753:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
27759:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
27765:http://www.altmp3.com/indian, 961fda478797d14d3e420f2b1a8994ee5e6b24f0, 0.37037037037
sibiryakov commented 8 years ago

requests are items in this case. we can fix that to avoid confusing other people. Do you use HBase backend? It looks like a bug. It would be nice you could reproduce it. Are you using the latest frontera?

lljrsr commented 8 years ago

Yes I am using the latest frontera version and the HBase Backend. I can easily reproduce it by using your master branch instead of the hbasefix branch. In this line it fills the results with duplicates.

sibiryakov commented 8 years ago

The question is if you put previously duplicate results in queue table? Because if so, then that's normal behavior. Can you debug it?

lljrsr commented 8 years ago

I just did a scan of my HBase queue table and cannot find duplicate ULRs or fingerprints.

sibiryakov commented 8 years ago

To fix this, we need to reproduce it. HBase Queue first retrieves URLs, and then removes them from table. Therefore if you first retrieved and removed your duplicates you will not see them with later scans. So you could try disable generation of new batches after some moment (cmd line option in DB worker), or modify the hbase queue code, and write data to a second table on scheduling, this table will not be used for retrieving queue items, but for debugging.

lljrsr commented 8 years ago

These are some duplicates I could find:

In https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/backends/hbase.py#L228 the meta_map looks like this:

fprint ö‘¯àÓË+ì!¥âf½µ½ûc  references this list: 
[('0_0.52_0.53_1453117831535941', ['\x15\xf6\x91\xaf\xe0\xd3\xcb+\xec!\x04\xa5\xe2\x05f\xbd\xb5\xbd\xfbc', -1589438432, 'http://3u3a.deviantart.com/?offset=2490', 0.47619047619047616])]
fprint ‰Ž‡¨JÆùÁóS =†Ž¯‰Ü  references this list: 
[('0_0.52_0.53_1453117831535941', ['\x1f\x89\x8e\x87\xa8J\xc6\xf9\xc1\xf3S =\x86\x8e\xaf\x11\x89\xdc\x0e', -1589438432, 'http://3u3a.deviantart.com/?offset=2480', 0.47619047619047616])]
fprint ø*­^@Ú“а³-V0ës~ì  references this list: 
[('0_0.52_0.53_1453117831535941', ['\xf8*\xad\x18^@\xda\x07\x93\xd0\xb0\xb3-V\x020\xebs~\xec', -1079107546, 'http://dark-yoolia.deviantart.com/', 0.47619047619047616])]

In https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/backends/hbase.py#L230 fprint_map looks like this:

rk 0_0.52_0.53_1453119214454922  references this map: 
['\x89\x9ed\x95j\x89J\x8b\xc7vzW|\x07\\\xc0\xaeN\x92\xe0', '\x94\xda\x0cU\x03=\xe3NB\\[\xea*Gx\xeb\xac|\xc1\xab', 'xQ\xfe?\x07\xc6\x10t\x89:\xbd\xe5T\x96>_\x14\xf0\x8bh', 'L\x06\xd8#\x7f\xab\xbe\xe0j\xcd\xf5\xb7\x12\xae\x97\x92\x18\xa0\x96)', 'w@l\t\xcc\xb6\x08$\xf90cF\xa7s\x92\x1e\x89\xdcin', '\xb9\xc3\xf0\xa3\x1f \xa6\xe7\x02\x8c9\xe0\x9e\xb5\x99\x8c\xdc\x99\xf9\x93', '\xdd^:\x83\xe7\xc3U\x07\x18\xa1\x0b~?`\xbeM29\x8a\xd3', '#/\xe9\xe0UZ\xce$b\xaa\xac\x87\r&\x85\x9e\x8ff\xd1\xa9', '\x12o\xbbQ\x04\xb0x\x8b\xef\xc2`\xeb\xa5>\rR\xcbg~\x84', ']\xa0\xd2\xa7P\x8b\x91R\xde\x18\xb3\xaf"\x0eK\x8e\xfb\xb4\xf3\x9c', '`\x14\xcf\x02\x8a\x19\xdd\rWp\xf4\x7f\xa9\xef\x88S\t\xdd@\x01', '^\xb6\xf35\x06#\xfaq\x97N\xcct\xa8`\x8a0n\xa2\x1f\xc1', '\xfc,\xa1\xec\x1a\x93_\xb3_\xe5\xe8\xe3C.M\x97\xce\x08Zq', '\x99M\x87\xbe?R\xca]\xbb\xd2\xa1\x13\xfb\xa2\x8a\xe4K\x91\xf9`', ';M\xe1\xf5\xd2\xc4\x93S\xc69!9\x83V\xb2\n\xa6\xea\x00\xb7', '\x10\xf5cHka^\xc5G|\x90\xd1\xbe\xa5\xc0\x85\x03\xa1`\xb7', 'T\xa3\xb6\xacg\xcf4B\x8a\xfbY\xe5\xfbk@\xf2\xaa\xc8\x83\xb0', '\x85\xdc\x8aB\x1c\xb9~!\x10\xce\x91\xf7w\xb8c\xb1\xb8\xe4*C', 'h\x94M\x1b\xa6:\t\xb7{\xe4\x93\x08\xf3O\x1e&\xef\xb3\x1cR', '\xa8?\xde\xebN\x00\x06o\xf0\x1b\x8eP\xc13\\\x1e\x9a ds', '4\xbc\x1b\x94d928$\xe8\x02\x9b\x03,Ts_\x9fb\xdf', "\xa56Fx\x17?\xad\xc2J\xbd7\x93x'\xd4\xee\xd1ncn", '\x16\xceI\xf2\x94I5\xc04E\xf1\xe47\x88\xae\x8d+\x997\xb3', 'Yi\xa6\xfd\x9b\xe8\x9c\x8e\xac\xe9\x0fx\xa1\xcb\xfb\xb0!\xbe\x8d\xbe', '\xa2\xe9\xa97\x07ghF(\xbf\x93,U\xb8W\x94\x01/\xf0\xd9', '\xd4\x98\xf3N\xab\xb4\xc9\xb6\x8bL*I\xda|\xc5f\x97\xc5J\x90', '\xa0m6W\xd4\x8b\xa8q\xf5\x1b\xd794\xec\xee{\xdb\xeb\xd9\xbd', '\x91*wG\xa4\x11\xbe\x9a\x18(\xc6\xeb\xa5\x9f_\x04\x03\xf3b\xf3', '+P"\xd1\xe3\x1b[ \x1f\x8aL\xa1\xd7\xda-\xa8\xef\'\xa9\xaf', '\xe9\x8d\x86\x83\xa1\xf3\xb2!1\x91\x8c\xf4\x1df\xf6U\x19\xfb\x12\xb0', '\x82\xc3\x17\xe3Iz\x95\xeam]-\x9a\xcb\xfe\x16\x1c\xcd\x9d\xf1\xa5', '\xc2\x93\xd9\x19q`[\xe7\x07\xc8k\xad>I\x19\\\xb2\xc3N\x1c', '*k6\xfa\xa1\x93\xec\xaf(\x12\xabS\xbc-\x92f\x8a&\xd8\xac', '\xb5+"\xdb\xf3\x1er:\x9b\x92X=\n\xeeaI\xa3f c', '\xf5p\xd9\x9e)\x9a\xda\x8c\x1d\xc5\x88\t\x1b\xa7\xde?\xa2\x98D\xd5', '`\x1d\xb8X\xa6q\x07\x8c\xeb\xaf\xb0\xc5\xc4\xa8\tx\x07\x17\x11\x10', '\xefH\x95\xf7\xf0\xf4\xbb\xffQ\xef1\xf3\x00\x05~\xbb\x05\x15\x14\xb1', '\xf5\xb1\x0c.D\x124\x9e\xc8\xd1\xed\xae5\xf3\x0b\x92P\xe2\xde,', ')\xdb{\xe4\xa2\x08\xb4&9\xcc\xe9\x11\x019\x95$Tp\xfe\xb3', '\xfe\xa4\xcb\xb64X\x06DP\xc4/)\x01\xdd\xd3{D\x98{\x07', 'T\xf8\xa9T\x103b\xd4\xdf\xbb\xc6\xba\xf3\xeb\x88\xcd\xc0\xf5\x12\xd5', '/\xaf\x929\x9b\x86J\xdc3o5\xefB\x94\x86\x16B\x7ff\x83', '\xfa\x18I\x0f\xd4\xc5\x0fj\\\x08\x87pP^\xe8\xcc9^JO', '\xda\xb6\xb6[\x12\xd6\x01@\xc1",\xa2R\x06\x8bo{\xc2\x07\x0c', 'e"\xf3j#y1\xcb\xbf\xbeQ\x12\x14S\xce\xe8y\n{s', '?\x90"h~\x01\xeb\xb4\xe9\xcc\xd8\x10xA\xbb\x9a<\x17+\x95', " \x1a'\xe1\xcc6\xf9\x87\x11]\x9a\xcf\xcdj\xdd\xd9\xc2\xb6:'", '\x1f\xd1\xda\xbbZ\xe2\x81\x08(\x9c\xceI\x1c\xa3\x82\xc4\xa3`\x9a\xd2', '\x07&;\xd3\xa5\x05\xc3\x0e\x1b?3$\x93tUi\xbc\xed*\x1a', "\xeb>%H\xaf.\xb7\xe1'\xad]\x8c\x81@E\x17\xe2P\xe9\x07", '\xa2!O9I$\x17\xf0C|\xce\x9d\xaa\xf4\x0c\x15\x997\xde\x1e', '&\x0e\t\xde\xef\xb2D\xd3!?\xedY\x00\xa5`\n\x97\xf7\x01\x06', 'a\x1c\xc1\xcc\xe0\x8b]\xe4\xeb\xef(\x80\xa0\x8ap(\x87D\xf4\xfa', "\x85\x88\xb0\x9b\x89K'\xa3,\x81\xa3g\xd3(\xf5H}(2\xda", '\x9b\xf1\x02"y\xf9\x94^\xbaG\xfdV\xb6\xef\xad8T\xb1_\xc9', '}\x05fI\xe2>\xf5\xe6\xe6\xa2\xaf\xb5\xcb\x1d\x8f\xf4|\xdf\xc4\xe7', 'N\xa9\xb7\xa1\x8e\xa3\xae\x9f\x1dx\xcbG\xdc\xedU\xda\xf7c\xc2p', ' \xc7zz\xc4\x05\xe1h\xa0\x9cb\x80\x9e\xf1\x9aTtk\x03*', '\xf9\xb9\xa72M\xa3\xdf\xad\x1e\xa7\xa9\x953Y&Q\x0f\x9c\xb1?', 'V\xad\xe3\xed\xa4\x9fe~\x92\x0e\xcf\x05\xa3\xde\xf3G2\xc5\x082', '\x16w\xd2\xe9\xec\n\x05k\x14\x95\xcb\xdat\x1aF\xe4G\x86\x9d-', '-\x7f\x1d\n\xda\xdf*\xcf{\xa5\xb9\xb3\xcd\xdf9M\xafw\xfd\xb1', 'y\xcd@w\xf2\x89\xdd\xb9\xd2w\xa9rD\xc8\xefJ\x97\x0c\xaf\xfb', '\xeaG2\x1f\x0f\x8b\x9b\xba\xc8\x8d{\xbfv\xf65Um\x14\xc3\xa0', '\tvp\x8d\xa3\xbecY<\xec\xc9\x8e\x10\xce!\xd6\x8a@HR', '\xf5\x1d\xdd\xd4\xb3\x98\x89\xee\x83\xee\xcd3\x13\xd3\x99\rN\x9e\xcbr', '\xeb\x19\x97>\xe5\xfe^\xa3R\xaa0E\x10\xfdEaGv\x07s', '4\xf1\x85\x8fs-\xcd\x8b\x19\x13\x98\xb6\x18(\xd7o\x1a\xa1\x7f\xbe', "\xd3\xd8*\xda\x13\x8c5K\xad\xb2S.+\xda\xd2'\xaa\xcd\xb5\xe1", '\xb7\x85\xd6\x01S\xee\xb7\xfc\x1a\xa2(\x94\x05N\xebF&O\xea\xa4', 'U\xef\xb2\xa4b\\\x83\x15|{\x9e[\xf5\x8b\x8a0\xb2\x02\xd9\xbe', '\xea6\x16\x0f\xa7\xfa1a\xf7\xb9\x17\xb0\xedP\x8d\x0c\x99+[o', '\xe9\xfa\xc1GV`\x9c\xbc\xf0\x80D\x02\xfd\xcd.\x06\xcdn\x03>', '>\xaa\x92\x08[\x16\xc8\xe9\x16\xe9\xf7D#\xae\xb1l\x90\xbb\xb0\xbc', '\xca\x83o^|\x97e#\xeeT@\xf4\xbb\x89\x06E\xf0{\x82)', '\xcf\xdf*\xc5[\x83\xab\x7f\xe1\x9c\xb3\xb2\xf6\xa4\x1c\xd2\xcd\x0c\xeb\x8d', 'G%F\x82\xf5\xa8\x9b\xaa\xec\xd0\xe48P\xfb\xb7%c\xc68%', '\xb9\x1d\x1e`\xa5\xea\x0b\xcds\x94\xa2\xe8\xe7\x07|\x85nZ,\xcd', '\xd92\x89+\xbc_\xf9\x00\xdf80\xc7PT\x19\xd4\x8c3\xf4\x87', '\xa0\xf6!\xfeu<\xa1\xc5f\xf4zv\x8c6S\xabh\xe0\x82m', '\x1d5\xea\xfa\x0c\xf4r\x84\x87\xf1\xb1pcK\x1d\x0c\x05t\x00\xfd', 'Z\xa4r/\x94\x07\xde\xf9Z-\xaf\x96\x89\xdfM\xb5%\xbb?\x1a', '\xc8\x89\xda\x9e\x95-\x1cV\xe3\xb3\x1c\xca\xf4s\xe3Hl1\x05d', 'V\x96\xb7\n\xb9\x1b\xf5\x1f"I\xc8\x94m\xfe\xe0\xf6%*;\xf4', 'y\xe7\x7f\xea-\x93\xd0R$\xd0\x9a\xb4\x0f\xd3p\x15\xc4\xadFh', '\xf6\xa5\x0f\xcf@G\xe8\x1e\xc6{\x94\xdd\xd5\x10Wx\xf9I\x8fM', '\x0f\xda\x98\xba\xd3\xc9\x9f\xd6\xc9g\x18\xfd\x17\xb2\x90E\xfb\x954\xea', '\xe6\xa3\xd5)\x83\x86\xe3~"\x9c\xe5\x9a\xdc}K\xd4\xd1\x13\x01\x94', 'f\x95\\\xb6\xb4e\xf8\x16\xc2s\xfb\xaaF-\xe1=q_\xf1\x8b', '\x8df#\xd6)\xfb\xf3\xf8\x1c\x0f\x92\xd9\xb3p5\xaez\xd7\t\xdc', '\xc7\xfc\xae\xc8S\x1b.\xf3\x1c\x0b\x8f\x16T\x87X\x16\xd2]U\x9f', '\xa6P\x17\x84?\xb0l\x888\xb6\x8c\x82\xf1(\x14m\x89\xae$\x01', '\xc1\x0b\x00\xf8\xec\xf3(\x1cR\xf4\xbea\n\xfb\xa5\xc4\x9d\x10\xf9\x0c', 'Nq#\x8ey\xea4L\x94G)\\\x12\x7f1\x9a\x94\xc2=\xb2', '\xf1\x8b\xc8\xcf\xce\xe7\x9f\x88\xe3\xb3v\r\x00\x9f\x8f\xb5\x94\x91=[', '\xbem\x80Oi\x8e\xbd\xd3\xe6iv\x1cx\xe5\xfb\xc0J\x9f*\xab', '\x136\xfc\xe3D\x8a\x17\x9e;\xb3\x8c\xf4\xa3\x1f\xce\xd9\xb3\x0e@p', 'v\xb5\xf4!\r\x8d\x1a\x92\xb6\x9b\xe3\xe5\x8b\xb4\xa6\x1a:Z\xb3\t', 'n\xde3\xfdY_\xca\x15\xa9\x9e\xcf+n\x1f\x11\x9b\xcb\x11D\xbd', '"\xefV\xce\x02\x91\xae\xdd\xa3\xee\xe2\xf8\xc4$\xa1\xfdx}\x17\xcf']
rk 0_0.52_0.53_1453119214454922  references this map: 
['\x89\x9ed\x95j\x89J\x8b\xc7vzW|\x07\\\xc0\xaeN\x92\xe0', '\x94\xda\x0cU\x03=\xe3NB\\[\xea*Gx\xeb\xac|\xc1\xab', 'xQ\xfe?\x07\xc6\x10t\x89:\xbd\xe5T\x96>_\x14\xf0\x8bh', 'L\x06\xd8#\x7f\xab\xbe\xe0j\xcd\xf5\xb7\x12\xae\x97\x92\x18\xa0\x96)', 'w@l\t\xcc\xb6\x08$\xf90cF\xa7s\x92\x1e\x89\xdcin', '\xb9\xc3\xf0\xa3\x1f \xa6\xe7\x02\x8c9\xe0\x9e\xb5\x99\x8c\xdc\x99\xf9\x93', '\xdd^:\x83\xe7\xc3U\x07\x18\xa1\x0b~?`\xbeM29\x8a\xd3', '#/\xe9\xe0UZ\xce$b\xaa\xac\x87\r&\x85\x9e\x8ff\xd1\xa9', '\x12o\xbbQ\x04\xb0x\x8b\xef\xc2`\xeb\xa5>\rR\xcbg~\x84', ']\xa0\xd2\xa7P\x8b\x91R\xde\x18\xb3\xaf"\x0eK\x8e\xfb\xb4\xf3\x9c', '`\x14\xcf\x02\x8a\x19\xdd\rWp\xf4\x7f\xa9\xef\x88S\t\xdd@\x01', '^\xb6\xf35\x06#\xfaq\x97N\xcct\xa8`\x8a0n\xa2\x1f\xc1', '\xfc,\xa1\xec\x1a\x93_\xb3_\xe5\xe8\xe3C.M\x97\xce\x08Zq', '\x99M\x87\xbe?R\xca]\xbb\xd2\xa1\x13\xfb\xa2\x8a\xe4K\x91\xf9`', ';M\xe1\xf5\xd2\xc4\x93S\xc69!9\x83V\xb2\n\xa6\xea\x00\xb7', '\x10\xf5cHka^\xc5G|\x90\xd1\xbe\xa5\xc0\x85\x03\xa1`\xb7', 'T\xa3\xb6\xacg\xcf4B\x8a\xfbY\xe5\xfbk@\xf2\xaa\xc8\x83\xb0', '\x85\xdc\x8aB\x1c\xb9~!\x10\xce\x91\xf7w\xb8c\xb1\xb8\xe4*C', 'h\x94M\x1b\xa6:\t\xb7{\xe4\x93\x08\xf3O\x1e&\xef\xb3\x1cR', '\xa8?\xde\xebN\x00\x06o\xf0\x1b\x8eP\xc13\\\x1e\x9a ds', '4\xbc\x1b\x94d928$\xe8\x02\x9b\x03,Ts_\x9fb\xdf', "\xa56Fx\x17?\xad\xc2J\xbd7\x93x'\xd4\xee\xd1ncn", '\x16\xceI\xf2\x94I5\xc04E\xf1\xe47\x88\xae\x8d+\x997\xb3', 'Yi\xa6\xfd\x9b\xe8\x9c\x8e\xac\xe9\x0fx\xa1\xcb\xfb\xb0!\xbe\x8d\xbe', '\xa2\xe9\xa97\x07ghF(\xbf\x93,U\xb8W\x94\x01/\xf0\xd9', '\xd4\x98\xf3N\xab\xb4\xc9\xb6\x8bL*I\xda|\xc5f\x97\xc5J\x90', '\xa0m6W\xd4\x8b\xa8q\xf5\x1b\xd794\xec\xee{\xdb\xeb\xd9\xbd', '\x91*wG\xa4\x11\xbe\x9a\x18(\xc6\xeb\xa5\x9f_\x04\x03\xf3b\xf3', '+P"\xd1\xe3\x1b[ \x1f\x8aL\xa1\xd7\xda-\xa8\xef\'\xa9\xaf', '\xe9\x8d\x86\x83\xa1\xf3\xb2!1\x91\x8c\xf4\x1df\xf6U\x19\xfb\x12\xb0', '\x82\xc3\x17\xe3Iz\x95\xeam]-\x9a\xcb\xfe\x16\x1c\xcd\x9d\xf1\xa5', '\xc2\x93\xd9\x19q`[\xe7\x07\xc8k\xad>I\x19\\\xb2\xc3N\x1c', '*k6\xfa\xa1\x93\xec\xaf(\x12\xabS\xbc-\x92f\x8a&\xd8\xac', '\xb5+"\xdb\xf3\x1er:\x9b\x92X=\n\xeeaI\xa3f c', '\xf5p\xd9\x9e)\x9a\xda\x8c\x1d\xc5\x88\t\x1b\xa7\xde?\xa2\x98D\xd5', '`\x1d\xb8X\xa6q\x07\x8c\xeb\xaf\xb0\xc5\xc4\xa8\tx\x07\x17\x11\x10', '\xefH\x95\xf7\xf0\xf4\xbb\xffQ\xef1\xf3\x00\x05~\xbb\x05\x15\x14\xb1', '\xf5\xb1\x0c.D\x124\x9e\xc8\xd1\xed\xae5\xf3\x0b\x92P\xe2\xde,', ')\xdb{\xe4\xa2\x08\xb4&9\xcc\xe9\x11\x019\x95$Tp\xfe\xb3', '\xfe\xa4\xcb\xb64X\x06DP\xc4/)\x01\xdd\xd3{D\x98{\x07', 'T\xf8\xa9T\x103b\xd4\xdf\xbb\xc6\xba\xf3\xeb\x88\xcd\xc0\xf5\x12\xd5', '/\xaf\x929\x9b\x86J\xdc3o5\xefB\x94\x86\x16B\x7ff\x83', '\xfa\x18I\x0f\xd4\xc5\x0fj\\\x08\x87pP^\xe8\xcc9^JO', '\xda\xb6\xb6[\x12\xd6\x01@\xc1",\xa2R\x06\x8bo{\xc2\x07\x0c', 'e"\xf3j#y1\xcb\xbf\xbeQ\x12\x14S\xce\xe8y\n{s', '?\x90"h~\x01\xeb\xb4\xe9\xcc\xd8\x10xA\xbb\x9a<\x17+\x95', " \x1a'\xe1\xcc6\xf9\x87\x11]\x9a\xcf\xcdj\xdd\xd9\xc2\xb6:'", '\x1f\xd1\xda\xbbZ\xe2\x81\x08(\x9c\xceI\x1c\xa3\x82\xc4\xa3`\x9a\xd2', '\x07&;\xd3\xa5\x05\xc3\x0e\x1b?3$\x93tUi\xbc\xed*\x1a', "\xeb>%H\xaf.\xb7\xe1'\xad]\x8c\x81@E\x17\xe2P\xe9\x07", '\xa2!O9I$\x17\xf0C|\xce\x9d\xaa\xf4\x0c\x15\x997\xde\x1e', '&\x0e\t\xde\xef\xb2D\xd3!?\xedY\x00\xa5`\n\x97\xf7\x01\x06', 'a\x1c\xc1\xcc\xe0\x8b]\xe4\xeb\xef(\x80\xa0\x8ap(\x87D\xf4\xfa', "\x85\x88\xb0\x9b\x89K'\xa3,\x81\xa3g\xd3(\xf5H}(2\xda", '\x9b\xf1\x02"y\xf9\x94^\xbaG\xfdV\xb6\xef\xad8T\xb1_\xc9', '}\x05fI\xe2>\xf5\xe6\xe6\xa2\xaf\xb5\xcb\x1d\x8f\xf4|\xdf\xc4\xe7', 'N\xa9\xb7\xa1\x8e\xa3\xae\x9f\x1dx\xcbG\xdc\xedU\xda\xf7c\xc2p', ' \xc7zz\xc4\x05\xe1h\xa0\x9cb\x80\x9e\xf1\x9aTtk\x03*', '\xf9\xb9\xa72M\xa3\xdf\xad\x1e\xa7\xa9\x953Y&Q\x0f\x9c\xb1?', 'V\xad\xe3\xed\xa4\x9fe~\x92\x0e\xcf\x05\xa3\xde\xf3G2\xc5\x082', '\x16w\xd2\xe9\xec\n\x05k\x14\x95\xcb\xdat\x1aF\xe4G\x86\x9d-', '-\x7f\x1d\n\xda\xdf*\xcf{\xa5\xb9\xb3\xcd\xdf9M\xafw\xfd\xb1', 'y\xcd@w\xf2\x89\xdd\xb9\xd2w\xa9rD\xc8\xefJ\x97\x0c\xaf\xfb', '\xeaG2\x1f\x0f\x8b\x9b\xba\xc8\x8d{\xbfv\xf65Um\x14\xc3\xa0', '\tvp\x8d\xa3\xbecY<\xec\xc9\x8e\x10\xce!\xd6\x8a@HR', '\xf5\x1d\xdd\xd4\xb3\x98\x89\xee\x83\xee\xcd3\x13\xd3\x99\rN\x9e\xcbr', '\xeb\x19\x97>\xe5\xfe^\xa3R\xaa0E\x10\xfdEaGv\x07s', '4\xf1\x85\x8fs-\xcd\x8b\x19\x13\x98\xb6\x18(\xd7o\x1a\xa1\x7f\xbe', "\xd3\xd8*\xda\x13\x8c5K\xad\xb2S.+\xda\xd2'\xaa\xcd\xb5\xe1", '\xb7\x85\xd6\x01S\xee\xb7\xfc\x1a\xa2(\x94\x05N\xebF&O\xea\xa4', 'U\xef\xb2\xa4b\\\x83\x15|{\x9e[\xf5\x8b\x8a0\xb2\x02\xd9\xbe', '\xea6\x16\x0f\xa7\xfa1a\xf7\xb9\x17\xb0\xedP\x8d\x0c\x99+[o', '\xe9\xfa\xc1GV`\x9c\xbc\xf0\x80D\x02\xfd\xcd.\x06\xcdn\x03>', '>\xaa\x92\x08[\x16\xc8\xe9\x16\xe9\xf7D#\xae\xb1l\x90\xbb\xb0\xbc', '\xca\x83o^|\x97e#\xeeT@\xf4\xbb\x89\x06E\xf0{\x82)', '\xcf\xdf*\xc5[\x83\xab\x7f\xe1\x9c\xb3\xb2\xf6\xa4\x1c\xd2\xcd\x0c\xeb\x8d', 'G%F\x82\xf5\xa8\x9b\xaa\xec\xd0\xe48P\xfb\xb7%c\xc68%', '\xb9\x1d\x1e`\xa5\xea\x0b\xcds\x94\xa2\xe8\xe7\x07|\x85nZ,\xcd', '\xd92\x89+\xbc_\xf9\x00\xdf80\xc7PT\x19\xd4\x8c3\xf4\x87', '\xa0\xf6!\xfeu<\xa1\xc5f\xf4zv\x8c6S\xabh\xe0\x82m', '\x1d5\xea\xfa\x0c\xf4r\x84\x87\xf1\xb1pcK\x1d\x0c\x05t\x00\xfd', 'Z\xa4r/\x94\x07\xde\xf9Z-\xaf\x96\x89\xdfM\xb5%\xbb?\x1a', '\xc8\x89\xda\x9e\x95-\x1cV\xe3\xb3\x1c\xca\xf4s\xe3Hl1\x05d', 'V\x96\xb7\n\xb9\x1b\xf5\x1f"I\xc8\x94m\xfe\xe0\xf6%*;\xf4', 'y\xe7\x7f\xea-\x93\xd0R$\xd0\x9a\xb4\x0f\xd3p\x15\xc4\xadFh', '\xf6\xa5\x0f\xcf@G\xe8\x1e\xc6{\x94\xdd\xd5\x10Wx\xf9I\x8fM', '\x0f\xda\x98\xba\xd3\xc9\x9f\xd6\xc9g\x18\xfd\x17\xb2\x90E\xfb\x954\xea', '\xe6\xa3\xd5)\x83\x86\xe3~"\x9c\xe5\x9a\xdc}K\xd4\xd1\x13\x01\x94', 'f\x95\\\xb6\xb4e\xf8\x16\xc2s\xfb\xaaF-\xe1=q_\xf1\x8b', '\x8df#\xd6)\xfb\xf3\xf8\x1c\x0f\x92\xd9\xb3p5\xaez\xd7\t\xdc', '\xc7\xfc\xae\xc8S\x1b.\xf3\x1c\x0b\x8f\x16T\x87X\x16\xd2]U\x9f', '\xa6P\x17\x84?\xb0l\x888\xb6\x8c\x82\xf1(\x14m\x89\xae$\x01', '\xc1\x0b\x00\xf8\xec\xf3(\x1cR\xf4\xbea\n\xfb\xa5\xc4\x9d\x10\xf9\x0c', 'Nq#\x8ey\xea4L\x94G)\\\x12\x7f1\x9a\x94\xc2=\xb2', '\xf1\x8b\xc8\xcf\xce\xe7\x9f\x88\xe3\xb3v\r\x00\x9f\x8f\xb5\x94\x91=[', '\xbem\x80Oi\x8e\xbd\xd3\xe6iv\x1cx\xe5\xfb\xc0J\x9f*\xab', '\x136\xfc\xe3D\x8a\x17\x9e;\xb3\x8c\xf4\xa3\x1f\xce\xd9\xb3\x0e@p', 'v\xb5\xf4!\r\x8d\x1a\x92\xb6\x9b\xe3\xe5\x8b\xb4\xa6\x1a:Z\xb3\t', 'n\xde3\xfdY_\xca\x15\xa9\x9e\xcf+n\x1f\x11\x9b\xcb\x11D\xbd', '"\xefV\xce\x02\x91\xae\xdd\xa3\xee\xe2\xf8\xc4$\xa1\xfdx}\x17\xcf']

Different fingerprints reference the same row key, which is then called multiple times, which then creates the duplicates. The row key should only be called once in Line 230. I will now look further into this. Any input is appreciated :) .

lljrsr commented 8 years ago

I printed out my HBase queue after disabling batch creation with: echo "scan 'crawler:queue'" | ./hbase shell > myTest. I still was unable to locate any duplicates within the table with grep. The duplicates should be easily noticeable since nearly every URL was sent more than 4 times to my crawler. Right now I am thinking that this is a bug in the HBase backend (As I mentioned in my pull reqyuest). Any more infos that you want from me?

lljrsr commented 8 years ago

Fixed with #94