lemire / despacer

C library to remove white space from strings as fast as possible
BSD 3-Clause "New" or "Revised" License
151 stars 14 forks source link

Stip tabs as well #8

Open thedrow opened 6 years ago

thedrow commented 6 years ago

Because they are whitespaces just like \n and \r.

lemire commented 6 years ago

Right. Actually, ASCII characters with code points less or equal to 32 could be pruned. It would probably be faster and easier than what is currently done.

thedrow commented 6 years ago

Why would any code point below or equal 32 should be stripped?

lemire commented 6 years ago

Can you give me such a character you’d want to preserve in a context where you want to remove white spaces?

thedrow commented 6 years ago

We're looking into adapting the code here to PyPy. The code in this repository would be used by 'my string'.strip() This is the definition of whitespace in PyPy: https://bitbucket.org/pypy/pypy/src/4a06e23782cf5422ad811c7c70eb7d2159961892/rpython/rlib/unicodedata/unicodedb_9_0_0.py#lines-20285 CPython's definition is even harder to understand.

In any case, tabs are stripped for sure.

cfbolz commented 6 years ago

FWIW, that's the definition for unicode strings in pypy, where you need to look up in the unicode db to find out what the whitespace chars are. I doubt that's SIMDable. The str.isspace definition looks like this:

    def ll_char_isspace(ch):
        c = ord(ch)
        return c == 32 or (9 <= c <= 13)
lemire commented 6 years ago

That’s interesting.

Yes, there are extended white space characters, but we probably can deal with them efficiently.

I definitively think some good results are possible but extra work is needed.

If someone wants to help, great... if not I’ll try to find time soon to get early results.

thedrow commented 6 years ago

The purpose of this project is for me to learn to use SIMD. If you are willing to mentor, I can code this.

lemire commented 6 years ago

Of course, I am!

We can chat about the problem if you’d like.

thedrow commented 6 years ago

Sure. I just sent you a message on Skype.

thedrow commented 6 years ago

@lemire If you prefer IRC, I'm on #pypy on freenode.

lemire commented 6 years ago

Can you reach out to me by email so we can fix something up?

email: lemire@gmail.com

aqrit commented 6 years ago

fwiw, https://github.com/lemire/despacer/blob/695c349015587518268b0090fbec46ff9935a959/include/despacer.h#L611 shows how to strip tab, space, cr, & lf.

lemire commented 6 years ago

For the record, if you have UTF-8 strings, the definition of what is a "white space" is a tad complicated... https://en.wikipedia.org/wiki/Whitespace_character

In ASCII, I think @cfbolz's definition is probably the correct one:

def ll_char_isspace(ch):
        c = ord(ch)
        return c == 32 or (9 <= c <= 13)
lemire commented 6 years ago

@aqrit

It would be more correct to use code like this:

__m128i mask_20 = _mm_set1_epi8( 0x20 );// c==32
__m128i mask_70 = _mm_set1_epi8( 0x70 );// adding 0x70 does not check low 4-bits
 // but moves any value >= 16 above 128

  //for 9 <= c <= 13:
 __m128i lut_cntrl = _mm_setr_epi8(
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00);

__m128i v = ... // your data

__m128i bytemask = _mm_or_si128(_mm_cmpeq_epi8(mask_20, v),
            _mm_shuffle_epi8(lut_cntrl,  _mm_adds_epu8(mask_70, v)));
// bytemask has 0xFF on ASCII white space
lemire commented 6 years ago

@aqrit Though it depends, looking at the XML and JSON specs, white space is defined as

              %x20 /              ; Space
              %x09 /              ; Horizontal tab
              %x0A /              ; Line feed or New line
              %x0D )              ; Carriage return

And thus the code we have now is correct...

    const __m128i lut_cntrl = _mm_setr_epi8(
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0xFF, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00);

So I guess there is an issue of semantic regarding how white space is defined.

aqrit commented 6 years ago

ofc, setting the sign bits "if in range" can be done with saturation.

__m128i bytemask = _mm_or_si128(_mm_cmpeq_epi8(mask_20, v),
            _mm_subs_epu8(_mm_add_epi8(mask_F2, v), mask_7B));
lemire commented 6 years ago

@aqrit

I guess your point is that saturated arithmetic can be more efficient than using comparison instructions, right?

aqrit commented 6 years ago

set1/subtract is probably better than load128/shuffle, where applicable.

lemire commented 6 years ago

My understanding is that PyPy uses (currently) UTF-32. For UTF-32, here is a vectorizable scalar function that removes all unicode white space as per the specification:

uint8_t nibble1[16] = {156, 16, 16, 16, 16, 18, 16, 16, 48, 49, 17, 1, 1, 1, 0, 96};
uint8_t nibble2[16] = {145, 0, 36, 0, 0, 64, 0, 0, 10, 0, 4, 0, 0, 0, 0, 0};
uint8_t nibble3[16] = {247, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0};
uint8_t nibble4[16] = {7, 8, 112, 128, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};

// less naive, but still naive
size_t lu_removewhite(uint32_t * charlist, size_t length) {
  size_t pos = 0;
  for(size_t i = 0; i < length; i++) {
    uint32_t char32 = charlist[i];
    uint8_t n1 = nibble1[(char32 >> (32 - 8)) & 0xF];
    uint8_t n2 = nibble2[char32 >> (32 - 4)];
    uint8_t n3 = nibble3[(char32 >> (32 - 16)) & 0xF];
    uint8_t n4 = nibble4[(char32 >> (32 - 12)) & 0xF];

    bool is_white_space = ((n1 & n2 & n3 & n4) != 0) && ((char32 &0xFFFF) == 0);
    if(!is_white_space){
      charlist[pos++] = char32;
    }
  }
  return pos;
}

Recall that the white space unicode code points are, by definition, 0x9, 0xA, 0xB, 0xC, 0xD, 0x20, 0x85, 0xA0, 0x1680, 0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005, 0x2006, 0x2007, 0x2008,0x2009, 0x200A, 0x2028, 0x2029, 0x202F, 0x205F, 0x3000.

@cfbolz seems to hint at a different definition:

    def ll_char_isspace(ch):
        c = ord(ch)
        return c == 32 or (9 <= c <= 13)

But I am not sure how that can be right. In Python 3, the following should return true:

u'\u2029'.isspace()

Anyhow, going back to my code above, the computation of n1, n2, n3, n4 can be vectorized with _mm_shuffle_epi8 so that the four loads are replaced by a few pshufb instructions.

The trickier part is to shuffle the nibble around.

Of course, I am not certain that PyPy does use UTF-32. That's a guess right now after looking at the code quickly.

lemire commented 6 years ago

@thedrow I'm interested in hearing your thoughts as to how this could be adapted to PyPy in principle given that PyPy is written in what is effectively Python. Is there support for SIMD instructions within Python? I am not aware of any such thing...

Of course, we could use extensions, but then PyPy seems like exactly the wrong platform.

lemire commented 6 years ago

@thedrow Ah. I see that you have some experience interfacing C/C++ code with PyPy... http://omerkatz.com/blog/2015/09/11/using-protobuf-with-pypy

So there might be more hope there than I thought... still, one should be worried about the call overhead. If you need to copy data back and forth, that's not great.

cfbolz commented 6 years ago

It's possible to pass the underlying buffer of a PyPy string to a C function without copying it, so that wouldn't be a problem.

thedrow commented 6 years ago

My intention was to make this the default implementation whenever possible. We'll create an RPython module that will use cpuid to detect if the SSE version we're using is available first. After that, we will use the most optimal implementation by invoking the C code from RPython.

I see no reason why PyPy, which aims to be the fastest Python runtime shouldn't include this code.

lemire commented 6 years ago

@thedrow

And PyPy uses UTF-32, correct?

thedrow commented 6 years ago

Internally PyPy uses UTF-8 as far as I can tell. See http://doc.pypy.org/en/latest/objspace.html#newutf8 I'm asking in the PyPy IRC channel now.

thedrow commented 6 years ago

@lemire We're both correct. PyPy uses UTF-32 internally but they want to move to UTF-8 as soon as possible.