Fix is_ascii function - Githubissues

Validark commented 1 year ago

I tried a couple of methods, and found this to be the fastest version.

travisstaloch commented 1 year ago

Just to add some context. This is a fix for a mis-translation of the c++ code. The original is here and here

from discussion on discord:

The old is_ascii was only returning true for chunks of all 0's

_mm_movemask_epi8(*this) == 0; This function does this: "Create mask from the most significant bit of each 8-bit element in a, and store the result in dst."

Thanks for finding and fixing this :+1: . Just going to give it a quick spin on my machine. Planning to merge afterward.

travisstaloch commented 1 year ago

ran a quick benchmark on test/twitter.json which contains a lot of unicode. seems like a slight speedup. i ran a few times and there were runs that showed a 5% slowdown. but this seems like a solid win given that it improves correctness and doesn't hurt or perhaps improves performance.

$ sudo ../poop/zig-out/bin/poop -d 10000 "zig-out/bin/simdjzonOld $testfile" "zig-out/bin/simdjzon $testfile"
Benchmark 1 (7455 runs): zig-out/bin/simdjzonOld test/twitter.json
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.30ms ±  354us     788us … 2.70ms         13 ( 0%)        0%
  peak_rss           5.18MB ± 1.67MB    3.03MB … 7.22MB          0 ( 0%)        0%
  cpu_cycles         1.52M  ± 71.0K     1.45M  … 2.46M         970 (13%)        0%
  instructions       4.79M  ± 23.0      4.79M  … 4.79M        1685 (23%)        0%
  cache_references   78.1K  ± 2.11K     74.0K  …  144K         353 ( 5%)        0%
  cache_misses       2.10K  ±  124      1.74K  … 3.36K          36 ( 0%)        0%
  branch_misses      3.47K  ±  249      2.98K  … 5.26K         448 ( 6%)        0%
Benchmark 2 (7659 runs): zig-out/bin/simdjzon test/twitter.json
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.27ms ±  362us     803us … 2.59ms          3 ( 0%)        ⚡-  2.5% ±  0.9%
  peak_rss           5.17MB ± 1.66MB    3.03MB … 7.22MB          0 ( 0%)          -  0.2% ±  1.0%
  cpu_cycles         1.49M  ± 58.2K     1.42M  … 2.31M         831 (11%)        ⚡-  2.2% ±  0.1%
  instructions       4.40M  ± 22.9      4.40M  … 4.41M        1723 (22%)        ⚡-  8.0% ±  0.0%
  cache_references   78.1K  ± 1.41K     74.1K  …  129K         301 ( 4%)          -  0.0% ±  0.1%
  cache_misses       2.15K  ±  127      1.72K  … 3.08K          40 ( 1%)        💩+  2.1% ±  0.2%
  branch_misses      3.70K  ±  259      3.17K  … 5.44K         388 ( 5%)        💩+  6.8% ±  0.2%

Thanks again! :heart:

Validark commented 1 year ago

One thing you could try is removing the is_ascii check and unconditionally validating the entire document as utf8. That might get you a speedup when there are a lot of non-ascii characters. The speedup given by this change is when the is_ascii check is almost always true.

travisstaloch / simdjzon

Fix is_ascii function #16