PriorityQueue implementation is quadratic for our purposes - __contains__ does linear search

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

1. Run best_first_graph_search() which implements frontier as a  PriorityQueue 
as defined in utils.py

uniform_cost_search() also uses best_first_graph_search() so it demonstrates 
the same problem.

What is the expected output? What do you see instead?

I expect the "in" operator of PriorityQueue to operate in O(n) time, as 
explained on AIMA page 84: "The data structure for frontier needs to support 
efficient membership testing, so it should combine the capabilities of a 
priority queue and a hash table."

It is used e.g. in best_first_graph_search: "child not in frontier".

Instead, when the frontier gets big (thousands of Nodes) it runs like a dog.  
ipython's %prun is good for testing this, pointing out over ten million 
executions of three functions.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 14715773   19.519    0.000   38.611    0.000 utils.py:747(<lambda>)
 17112459   18.401    0.000   21.727    0.000 search.py:105(__eq__)
    13780    9.214    0.001   47.825    0.003 utils.py:338(some)
 17121581    3.334    0.000    3.334    0.000 {isinstance}
     4660    2.314    0.000    4.942    0.001 utils.py:748(__getitem__)

The reason is clear when we look at the code and note this member function of 
class PriorityQueue:

    def __contains__(self, item):
        return some(lambda (_, x): x == item, self.A)

What version of the product are you using? On what operating system?
svn revision r201

Please provide any additional information below.

The Python Queue.PriorityQueue class might be a good thing to build on.

Original issue reported on code.google.com by neal...@gmail.com on 19 Jul 2012 at 5:07

GoogleCodeExporter commented 9 years ago

the Queue.PriorityQueue class is appropriate when you have to handle parallel 
access to the queue (in multi-threads programs). For our purpose, I think the 
heapq module is preferable.

Original comment by glewe...@gmail.com on 20 Jul 2012 at 3:40

GoogleCodeExporter commented 9 years ago

I've written a fix to make the function PriorityQueue.__contains__() run in 
constant time, rather than being linear in the number of items on the queue.  
It simply adds a hash table for membership testing.   I've pushed it to a 
branch on github:

 https://github.com/nealmcb/aima-python/tree/fasterPriorityQueue

It sped up my solution for the second problem in ai-class.org unit 22 "Optional 
NLP Programming" by a huge factor.

Note that __getitem__() and __delitem__() are still linear time, but could 
easily be improved also if necessary.  But in my current usage they aren't 
hotspots.

I didn't switch to use the standard hashq or Queue.PriorityQueue modules since 
they don't provide a constant-time membership test, but it still would seem 
better to use one of them.

My fix also includes another commit which improves the doctests a bit.  They 
now test membership after pops.

A patch is attached.

Original comment by neal...@gmail.com on 30 Jul 2012 at 4:25

Attachments:

fasterPriorityQueue.patch

GoogleCodeExporter commented 9 years ago

Looking up the hash of a key is incorrect because different keys may have the 
same hash, right? I think this could be fixed to store the actual keys, though.

Original comment by wit...@gmail.com on 20 May 2013 at 11:07

mkowoods / aima-python

PriorityQueue implementation is quadratic for our purposes - contains does linear search #31

mkowoods / aima-python

PriorityQueue implementation is quadratic for our purposes - __contains__ does linear search #31

PriorityQueue implementation is quadratic for our purposes - contains does linear search #31