Inefficient Hash Table implementation

trekhleb / javascript-algorithms

📝 Algorithms and data structures implemented in JavaScript with explanations and links to further readings

MIT License

185.78k stars 29.96k forks source link

Inefficient Hash Table implementation #529

Open David263 opened 3 years ago

David263 commented 3 years ago

There is no good reason to publish poorly-performing algorithms, in my opinion. Hash Tables are frequently used in inner loops, so it is important that they run fast.

The hash function used here has much published criticism. It could be replaced by a variant of Pearson Hashing, which runs very quickly and has good statistical properties for avoiding collisions. Note also that careful use of JavaScript datatypes is needed for efficient and fast data storage.

JackyYin commented 3 years ago

I am trying to implement the Pearson Hashing. Please check PR #596

JackyYin commented 3 years ago

The test coverage failed because test case is not covering a local variable inside a function. I am not sure if it is a good idea to test the line coverage in this case. Any suggestion to fix this problem?

David263 commented 3 years ago

If you are asking me, the author of this suggestion, I have no answer. I am not familiar with the testing system that you are using. If a test case it missing, why not add it? I can answer only questions about the Pearson Hashing algorithm, such as how to extend it beyond its natural limit of 8 bits of key data. I have used Pearson Hashing in real products and find that it performs quite randomly, as desired, producing fewer collisions than more common types of hashing functions.

lazarljubenovic commented 3 years ago

There is no good reason to publish poorly-performing algorithms, in my opinion.

The nature of the repo is to provide implementations for learning. They're not production-ready. That said, I agree that bad solutions shouldn't be here. If a solution is sub-optimal but provides insight into a technique, I think it can remain here under “Basic”, with a clear banner explaining why it's not optimal and where to look further to gain deeper understanding.

JackyYin commented 3 years ago

@David263 Thanks, Because it seems like you have a rich experience in Pearson Hashing algorithm, could you give me some advise about my implementation? I will generate a 32-bits signed int hash, which is generated by ORing four 8-bits hash, and right shift 1 bit to make sure I get a positive number(the left most bit should be 0). I am not sure whether it's a good idea to do the right shift, or I should just use an absolute function?

David263 commented 3 years ago

Right shifting can do different things on different hardware. Absolute value produces different bit results on 1's-complement and 2's-complement hardware. For these reasons, it's safer to mask the value using an AND bit operator to clear the sign bit. As to constructing a 31-bit hash, you should use shift and OR operators to move each byte to the proper place in the word. You will probably need a final mask to get the key in the domain of your table, since you probably do not want to devote a full 31 bits of contiguous memory space to the hash buckets (unless your hardware supports sparse arrays). You can find implementations of Pearson hashing for longer word lengths on the Web, when in doubt.

ilyas-wise commented 3 years ago

@David263 , excellent you are gave me a good point thanks I got it some of before a mistake and today memorized where else's has to possible! Thanks