street-address-rb / street-address

Detect, and dissect, US Street Addresses in strings.
MIT License
168 stars 85 forks source link

Optimized for performance #3

Closed mattruggio closed 9 years ago

mattruggio commented 11 years ago

Thank you for this class, it was extremely helpful in understanding address parsing and normalization within Ruby. For our case, we needed to parse datasets that have 6,000,000+ addresses. In order to use your class, we found some very slight modifications could be made that would make it extremely fast.

Here are some benchmarks using the addresses and intersections supplied in the tests:

Before Optimization

(Time in milliseconds) - Address (1.0882) - 2730 S Veitch St Apt 207, Arlington, VA 22206 (1.1823) - 44 Canal Center Plaza Suite 500, Alexandria, VA 22314 (1.1187) - 1600 Pennsylvania Ave Washington DC (1.2637) - 1005 Gravenstein Hwy N, Sebastopol CA 95472 (1.2633) - PO BOX 450, Chicago IL 60657 (1.179) - 2730 S Veitch St #207, Arlington, VA 22206 (2.1652) - Hollywood & Vine, Los Angeles, CA (2.3919) - Hollywood Blvd and Vine St, Los Angeles, CA (2.3417) - Mission Street at Valencia Street, San Francisco, CA

After Optimization

(Time in milliseconds) - Address (0.0129) - 2730 S Veitch St Apt 207, Arlington, VA 22206 (0.0073) - 44 Canal Center Plaza Suite 500, Alexandria, VA 22314 (0.0063) - 1600 Pennsylvania Ave Washington DC (0.0053) - 1005 Gravenstein Hwy N, Sebastopol CA 95472 (0.0006) - PO BOX 450, Chicago IL 60657 (0.0086) - 2730 S Veitch St #207, Arlington, VA 22206 (0.0143) - Hollywood & Vine, Los Angeles, CA (0.009) - Hollywood Blvd and Vine St, Los Angeles, CA (0.0087) - Mission Street at Valencia Street, San Francisco, CA

I have supplied 3 commits for you:

  1. Added a simple benchmarking script so you can replicate these tests easily.
  2. Forced the encoding of the address/intersection to parse to US-ASCII.
  3. Made the regular expressions class-level and not instance/method level.

All tests were ran successfully during and after refactoring. Let me know your thoughts!

derrek commented 11 years ago

I have to study your change a bit! I've not messed with force encodings on a string by string basis. My instinct is that since this class only pertains to US addresses that the change you made is ok, but I'd like to read up a bit more.

mattruggio commented 11 years ago

Thats a good point. For our use case, we only needed US address parsing. Since the class is geared toward US, it seems to fit perfectly fine in the ASCII table (http://www.ascii.cl/htmlcodes.htm). But, it does limit the use cases for the US class. I could see it being passed in as an option (something like :ascii or :force_ascii, and if so, it will force the encoding. A quick comment header on the function could explain the option and let others know that if their addresses fit in this encoding, it could boost performance using the option. Thoughts?

derrek commented 9 years ago

Closing on two accounts.

  1. The change is a bit niche. If more people need speed let me know. Some new updates speed it up by about 3X.
  2. I've ignored for too long and the code has drifted substantially.