scrapinghub / frontera

A scalable frontier for web crawlers
BSD 3-Clause "New" or "Revised" License
1.3k stars 217 forks source link

More efficient memory backends #36

Open shaneaevans opened 9 years ago

shaneaevans commented 9 years ago

The memory backends are all implemented using heapq. This allows for some succinct code when using different crawl ordering, but it's less efficient than choosing more appropriate datastructures for each specific ordering (LIFO, FIFO, etc.)

sibiryakov commented 9 years ago

IMO, it's a premature optimization. Most of the time people use memory backend just to check that scrapy+frontera+spider mix is working, and to debug issues in other backends. IMO, no one expects performance from memory backend, and it's not used in production. Any other use cases?

shaneaevans commented 9 years ago

It's currently being used internally in Scrapinghub @sibiryakov and people are worrying about CPU and memory usage. If these are toy or test implementations it should be very clearly marked. I favour making some small changes so that they perform as you'd expect. I am not talking about low level optimizations, but things like using a list (instead of heap) for a LIFO.

kalessin commented 9 years ago

It is also the superclass of the hcf backend, which is really an hybrid (it does not always store requests in hcf, it can also send requests directly to memory)

sibiryakov commented 8 years ago

I rewrote memory LIFO and FIFO using deque in https://github.com/scrapinghub/frontera/pull/81

sibiryakov commented 8 years ago

any ideas what can be more efficient for DFS and BFS than heapq?

wilfre commented 8 years ago

Hi @sibiryakov , I think that list and sets/dicts (depending on the implementation) should be fine (or any other constant time access collection) but -- maybe this is a very silly question -- given that efficient implementations of DFS/BFS uses stacks/queues -- what's the main difference between FIFO/LIFO and DFS/BFS?

sibiryakov commented 8 years ago

@wilfre DFS stands for Depth-first, and BFS is Breadth-first. If you could imagine a link graph, than DFS is crawls links having the biggest distance from root (seeds in our case) document, and BFS is opposite, the smallest. Efficient data structure for such problem is likely HEAP, arranged by distance, which is currently implemented in Frontera by means of heapq module.

wilfre commented 8 years ago

Hi @sibiryakov, yes I think you're right if we would need to be prepared for any distance metric, or if the distance is somehow updated, but if we consider the distance as the number of hops from the root (please correct me if I'm wrong) we could use queues and stacks that are basically FIFO and LIFO datastructures like:

def bfs(G, s):
    P, Q = {s: None}, deque([s])    # Parents and FIFO queue
    while Q:
        u = Q.popleft()             # Constant-time for deque
        for v in G[u]:
            if v in P: continue     # Already has parent
            P[v] = u                # Reached from u: u is parent
            Q.append(v)
        # yield u
return P                            # Parent's tree
def dfs(G, s):
    U, S = set(), [s]               # Visited-set and list used as stack
    while S:
        u = S.pop()                 # LIFO pop
        if u in U: continue
        U.add(u)                    # Marking it as visited
        S.extend(G[u])              # Could also be Q.extend(G[u][::-1])
        yield u                     # Report u as visited

That's why I thought that maybe they were similar or share some similiraty. Please let me know if I'm mistaken.

sibiryakov commented 8 years ago

@wilfre Thanks for your suggestion, it looks interesting. Two points:

  1. we don't have a link graph until the documents are crawled, therefore G will be updated on every iteration in our case.
  2. your dfs implementation seems wrong, crawl should proceed from root to leafs, in your implementation it behaves more like bfs, iterating first pages within 1-hop distance.
wilfre commented 8 years ago

@sibiryakov No problem, maybe a simple crawl example could help us validate the differences. About the other points: 1: there's no need to have the complete graph G beforehand (it's being discovered by the DFS/BFS traversal), that doesn't change how BFS or DFS should behave because at each node (page) you can know which are the node's neighbors (or maybe I'm missing something).

  1. it's actually this guy's implementation (http://www.apress.com/9781430232377 pag. 102 - iter_dfs -) but you can test it with this to get the feeling (you should see a complete branch before the other gets visited):
G = {1: [2, 3],
     2: [4], 4: [5], 5: [6], 6: [7], 7: [],
     3: [8], 8: [9], 9: [10], 10: [11], 11: [12], 12: []}
for elem in dfs(G, 1):
    print elem