Open thedrow opened 6 years ago
Right. Actually, ASCII characters with code points less or equal to 32 could be pruned. It would probably be faster and easier than what is currently done.
Why would any code point below or equal 32 should be stripped?
Can you give me such a character you’d want to preserve in a context where you want to remove white spaces?
We're looking into adapting the code here to PyPy. The code in this repository would be used by 'my string'.strip()
This is the definition of whitespace in PyPy: https://bitbucket.org/pypy/pypy/src/4a06e23782cf5422ad811c7c70eb7d2159961892/rpython/rlib/unicodedata/unicodedb_9_0_0.py#lines-20285
CPython's definition is even harder to understand.
In any case, tabs are stripped for sure.
FWIW, that's the definition for unicode
strings in pypy, where you need to look up in the unicode db to find out what the whitespace chars are. I doubt that's SIMDable. The str.isspace
definition looks like this:
def ll_char_isspace(ch):
c = ord(ch)
return c == 32 or (9 <= c <= 13)
That’s interesting.
Yes, there are extended white space characters, but we probably can deal with them efficiently.
I definitively think some good results are possible but extra work is needed.
If someone wants to help, great... if not I’ll try to find time soon to get early results.
The purpose of this project is for me to learn to use SIMD. If you are willing to mentor, I can code this.
Of course, I am!
We can chat about the problem if you’d like.
Sure. I just sent you a message on Skype.
@lemire If you prefer IRC, I'm on #pypy
on freenode.
Can you reach out to me by email so we can fix something up?
email: lemire@gmail.com
fwiw, https://github.com/lemire/despacer/blob/695c349015587518268b0090fbec46ff9935a959/include/despacer.h#L611 shows how to strip tab, space, cr, & lf.
For the record, if you have UTF-8 strings, the definition of what is a "white space" is a tad complicated... https://en.wikipedia.org/wiki/Whitespace_character
In ASCII, I think @cfbolz's definition is probably the correct one:
def ll_char_isspace(ch):
c = ord(ch)
return c == 32 or (9 <= c <= 13)
@aqrit
It would be more correct to use code like this:
__m128i mask_20 = _mm_set1_epi8( 0x20 );// c==32
__m128i mask_70 = _mm_set1_epi8( 0x70 );// adding 0x70 does not check low 4-bits
// but moves any value >= 16 above 128
//for 9 <= c <= 13:
__m128i lut_cntrl = _mm_setr_epi8(
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00);
__m128i v = ... // your data
__m128i bytemask = _mm_or_si128(_mm_cmpeq_epi8(mask_20, v),
_mm_shuffle_epi8(lut_cntrl, _mm_adds_epu8(mask_70, v)));
// bytemask has 0xFF on ASCII white space
@aqrit Though it depends, looking at the XML and JSON specs, white space is defined as
%x20 / ; Space
%x09 / ; Horizontal tab
%x0A / ; Line feed or New line
%x0D ) ; Carriage return
And thus the code we have now is correct...
const __m128i lut_cntrl = _mm_setr_epi8(
0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
0x00, 0xFF, 0xFF, 0x00, 0x00, 0xFF, 0x00, 0x00);
So I guess there is an issue of semantic regarding how white space is defined.
ofc, setting the sign bits "if in range" can be done with saturation.
__m128i bytemask = _mm_or_si128(_mm_cmpeq_epi8(mask_20, v),
_mm_subs_epu8(_mm_add_epi8(mask_F2, v), mask_7B));
@aqrit
I guess your point is that saturated arithmetic can be more efficient than using comparison instructions, right?
set1/subtract
is probably better than load128/shuffle
, where applicable.
My understanding is that PyPy uses (currently) UTF-32. For UTF-32, here is a vectorizable scalar function that removes all unicode white space as per the specification:
uint8_t nibble1[16] = {156, 16, 16, 16, 16, 18, 16, 16, 48, 49, 17, 1, 1, 1, 0, 96};
uint8_t nibble2[16] = {145, 0, 36, 0, 0, 64, 0, 0, 10, 0, 4, 0, 0, 0, 0, 0};
uint8_t nibble3[16] = {247, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0};
uint8_t nibble4[16] = {7, 8, 112, 128, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
// less naive, but still naive
size_t lu_removewhite(uint32_t * charlist, size_t length) {
size_t pos = 0;
for(size_t i = 0; i < length; i++) {
uint32_t char32 = charlist[i];
uint8_t n1 = nibble1[(char32 >> (32 - 8)) & 0xF];
uint8_t n2 = nibble2[char32 >> (32 - 4)];
uint8_t n3 = nibble3[(char32 >> (32 - 16)) & 0xF];
uint8_t n4 = nibble4[(char32 >> (32 - 12)) & 0xF];
bool is_white_space = ((n1 & n2 & n3 & n4) != 0) && ((char32 &0xFFFF) == 0);
if(!is_white_space){
charlist[pos++] = char32;
}
}
return pos;
}
Recall that the white space unicode code points are, by definition, 0x9, 0xA, 0xB, 0xC, 0xD, 0x20, 0x85, 0xA0, 0x1680, 0x2000, 0x2001, 0x2002, 0x2003, 0x2004, 0x2005, 0x2006, 0x2007, 0x2008,0x2009, 0x200A, 0x2028, 0x2029, 0x202F, 0x205F, 0x3000
.
@cfbolz seems to hint at a different definition:
def ll_char_isspace(ch):
c = ord(ch)
return c == 32 or (9 <= c <= 13)
But I am not sure how that can be right. In Python 3, the following should return true:
u'\u2029'.isspace()
Anyhow, going back to my code above, the computation of n1, n2, n3, n4
can be vectorized with _mm_shuffle_epi8
so that the four loads are replaced by a few pshufb
instructions.
The trickier part is to shuffle the nibble around.
Of course, I am not certain that PyPy does use UTF-32. That's a guess right now after looking at the code quickly.
@thedrow I'm interested in hearing your thoughts as to how this could be adapted to PyPy in principle given that PyPy is written in what is effectively Python. Is there support for SIMD instructions within Python? I am not aware of any such thing...
Of course, we could use extensions, but then PyPy seems like exactly the wrong platform.
@thedrow Ah. I see that you have some experience interfacing C/C++ code with PyPy... http://omerkatz.com/blog/2015/09/11/using-protobuf-with-pypy
So there might be more hope there than I thought... still, one should be worried about the call overhead. If you need to copy data back and forth, that's not great.
It's possible to pass the underlying buffer of a PyPy string to a C function without copying it, so that wouldn't be a problem.
My intention was to make this the default implementation whenever possible. We'll create an RPython module that will use cpuid to detect if the SSE version we're using is available first. After that, we will use the most optimal implementation by invoking the C code from RPython.
I see no reason why PyPy, which aims to be the fastest Python runtime shouldn't include this code.
@thedrow
And PyPy uses UTF-32, correct?
Internally PyPy uses UTF-8 as far as I can tell. See http://doc.pypy.org/en/latest/objspace.html#newutf8 I'm asking in the PyPy IRC channel now.
@lemire We're both correct. PyPy uses UTF-32 internally but they want to move to UTF-8 as soon as possible.
Because they are whitespaces just like \n and \r.