lemire / despacer

C library to remove white space from strings as fast as possible
BSD 3-Clause "New" or "Revised" License
151 stars 14 forks source link

Find indices of white spaces within string #12

Open mingzian opened 3 years ago

mingzian commented 3 years ago

Amazing, amazing work!

Just thought of mentioning here that I would love to see this tweaked a bit such that, with the help of your simdprune library, one could find the indices of the white spaces chars within a given char string - instead of removing them. That would be of tremendous help in numerous string cleaning tasks.

lemire commented 3 years ago

@mingzian Can you elaborate on the applications you have in mind?

mingzian commented 3 years ago

@lemire absolutely! I have a few in mind actually. One relates to extracting/locating words within sentences. Oftentimes, it is wasteful to go over the entire sentences because one knows either which word index is needed (which nearly always coincides with white spaces plus index 0), or at least some heuristic of it (words is at first/second half of sentence, etc). Also, in the algorithms I deal with at work, knowing the char index of each white space helps us search for words within sentences much faster. Another is sentence comparisons: if you can know efficiently the location of white spaces, you get a lightning first drop of non-matches.

I think that in any of those tasks, knowing the location of white spaces faster than the standard scalar looping over each char would be of significant improvement.

lemire commented 3 years ago

Thanks.