Open shaneaevans opened 9 years ago
IMO, it's a premature optimization. Most of the time people use memory backend just to check that scrapy+frontera+spider mix is working, and to debug issues in other backends. IMO, no one expects performance from memory backend, and it's not used in production. Any other use cases?
It's currently being used internally in Scrapinghub @sibiryakov and people are worrying about CPU and memory usage. If these are toy or test implementations it should be very clearly marked. I favour making some small changes so that they perform as you'd expect. I am not talking about low level optimizations, but things like using a list (instead of heap) for a LIFO.
It is also the superclass of the hcf backend, which is really an hybrid (it does not always store requests in hcf, it can also send requests directly to memory)
I rewrote memory LIFO and FIFO using deque
in https://github.com/scrapinghub/frontera/pull/81
any ideas what can be more efficient for DFS and BFS than heapq?
Hi @sibiryakov , I think that list and sets/dicts (depending on the implementation) should be fine (or any other constant time access collection) but -- maybe this is a very silly question -- given that efficient implementations of DFS/BFS uses stacks/queues -- what's the main difference between FIFO/LIFO and DFS/BFS?
@wilfre DFS stands for Depth-first, and BFS is Breadth-first. If you could imagine a link graph, than DFS is crawls links having the biggest distance from root (seeds in our case) document, and BFS is opposite, the smallest. Efficient data structure for such problem is likely HEAP, arranged by distance, which is currently implemented in Frontera by means of heapq
module.
Hi @sibiryakov, yes I think you're right if we would need to be prepared for any distance metric, or if the distance is somehow updated, but if we consider the distance as the number of hops from the root (please correct me if I'm wrong) we could use queues and stacks that are basically FIFO and LIFO datastructures like:
def bfs(G, s):
P, Q = {s: None}, deque([s]) # Parents and FIFO queue
while Q:
u = Q.popleft() # Constant-time for deque
for v in G[u]:
if v in P: continue # Already has parent
P[v] = u # Reached from u: u is parent
Q.append(v)
# yield u
return P # Parent's tree
def dfs(G, s):
U, S = set(), [s] # Visited-set and list used as stack
while S:
u = S.pop() # LIFO pop
if u in U: continue
U.add(u) # Marking it as visited
S.extend(G[u]) # Could also be Q.extend(G[u][::-1])
yield u # Report u as visited
That's why I thought that maybe they were similar or share some similiraty. Please let me know if I'm mistaken.
@wilfre Thanks for your suggestion, it looks interesting. Two points:
@sibiryakov No problem, maybe a simple crawl example could help us validate the differences. About the other points: 1: there's no need to have the complete graph G beforehand (it's being discovered by the DFS/BFS traversal), that doesn't change how BFS or DFS should behave because at each node (page) you can know which are the node's neighbors (or maybe I'm missing something).
G = {1: [2, 3],
2: [4], 4: [5], 5: [6], 6: [7], 7: [],
3: [8], 8: [9], 9: [10], 10: [11], 11: [12], 12: []}
for elem in dfs(G, 1):
print elem
The memory backends are all implemented using heapq. This allows for some succinct code when using different crawl ordering, but it's less efficient than choosing more appropriate datastructures for each specific ordering (LIFO, FIFO, etc.)