php / php-src

The PHP Interpreter
https://www.php.net
Other
38.14k stars 7.74k forks source link

Segfaults, inconsistency and confusion when adding (+) `int` to typecasted FFI\CData #8874

Open exikyut opened 2 years ago

exikyut commented 2 years ago

Description

The following code:

<?php

$ffi = FFI::cdef('

  typedef long unsigned int size_t;
  void *malloc(size_t size);

');

$ptr = $ffi->cast('uint8_t *', $ffi->malloc(8));

print "A: "; var_dump($ptr);
print "B: "; var_dump($ptr + 1);
print "C: "; var_dump($ptr);
print "D: "; var_dump($ptr + 1);

Resulted in this output:

A: object(FFI\CData:uint8_t*)#4 (1) {
  [0]=>
  int(0)
}
B: object(FFI\CData:uint8_t*)#5 (1) {
  [0]=>
  int(0)
}
C: Segmentation fault

But I expected this output instead: Not sure what would be canonically correct

I get the following backtrace:

#0  0x00007ffff43a7915 in ?? () from /usr/lib/php/20210902/ffi.so
#1  0x00005555558be86d in zend_std_get_properties_for (obj=<optimized out>, purpose=<optimized out>) at ./Zend/zend_object_handlers.c:1880
#2  0x00005555557a77e1 in php_var_dump (struc=0x7ffff5413530, level=1) at ./ext/standard/var.c:163
#3  0x00005555557a7ce1 in zif_var_dump (execute_data=<optimized out>, return_value=<optimized out>) at ./ext/standard/var.c:228
#4  0x000055555589797b in ZEND_DO_ICALL_SPEC_RETVAL_UNUSED_HANDLER () at ./Zend/zend_vm_execute.h:1235
#5  execute_ex (ex=0x7ffff5455e40) at ./Zend/zend_vm_execute.h:55752
#6  0x000055555589eced in zend_execute (op_array=0x7ffff548c000, return_value=0x0) at ./Zend/zend_vm_execute.h:60123
#7  0x00005555558311cd in zend_execute_scripts (type=type@entry=8, retval=retval@entry=0x0, file_count=file_count@entry=3) at ./Zend/zend.c:1790
#8  0x00005555557ccc71 in php_execute_script (primary_file=primary_file@entry=0x7fffffffd760) at ./main/main.c:2538
#9  0x0000555555915673 in do_cli (argc=2, argv=0x555555ac5390) at ./sapi/cli/php_cli.c:969
#10 0x0000555555677f11 in main (argc=2, argv=0x555555ac5390) at ./sapi/cli/php_cli.c:1371

Here's line 1880:

https://github.com/php/php-src/blob/PHP-8.1.7/Zend/zend_object_handlers.c#L1880

Additionally,

This code (which swaps var_dump() for var_export(), to take the alternate path through zend_std_get_properties_for()):

<?php

$ffi = FFI::cdef('

  typedef long unsigned int size_t;
  void *malloc(size_t size);

');

$ptr = $ffi->cast('uint8_t *', $ffi->malloc(8));

print "A: "; var_export($ptr);
print "\nB: "; var_export($ptr + 1);
print "\nC: "; var_export($ptr);
print "\nD: "; var_export($ptr + 1);

Results in this output:

A: FFI\CData::__set_state(array(
))
B: FFI\CData::__set_state(array(
))
C: FFI\CData::__set_state(array(
))
D: PHP Fatal error:  Uncaught TypeError: Unsupported operand types: FFI\CData + int

But I expected this output instead: Not sure what would be canonically correct

hOwEvEr,

ThIs code:

<?php
$ffi = FFI::cdef('

  typedef long unsigned int size_t;
  void *malloc(size_t size);

');

$ptr = $ffi->cast('uint8_t *', $ffi->malloc(27));

// Fill memory with "A..Z" for presentational purposes
for ($i = 0; $i < 26; $i++) {
  $ptr[$i] = $i + 65; // 'A'
}

var_dump(FFI::string($ptr, 16));
var_dump(FFI::string($ptr + 5, 16));
var_dump(FFI::string($ptr + 5, 16));

Produces this output!!

string(16) "ABCDEFGHIJKLMNOP"
string(16) "FGHIJKLMNOPQRSTU"
PHP Fatal error:  Uncaught TypeError: Unsupported operand types: FFI\CData + int

Commentary:

I'm currently using the FFI to mmap() some data into memory (for both read and write). Reading and writing integer values (typecasted to uint16_t/uint32_t) is working perfectly fine (thanks for plumbing everything out so this is as simple as assigning to a PHP array :D), but now I needed to read and write sequences of characters (binary data I want written to specific offsets).

Naturally that's done in C by just assigning to ptr + offset (where ptr might be uint8_t * and offset is int), so I wondered if the same worked here. It... *does*, I think...? Except it crashes. This is why I said I don't know what the canonically-correct outcome here is: this is currently Schrodinger's architecture, where I'm equally observing "yes that's the right way to do it" and *crash* and I don't know where to put the consensus :)

I was reading the values I wanted to mmap() using (PHP-side) fread() so I'll either just stick with that or play with FFI::memcpy() for now. Thanks again for FFI, it's really awesome.

PHP Version

PHP 8.1.7 from sury.org

Operating System

Debian 11.3

exikyut commented 2 years ago

Welp, I thought FFI::memcpy() accepted an independent offset parameter - I hadn't checked. I might not be able to use the FFI for this :(

My attempts to get fancy with FFI:addr() produced Interesting™ results though; extending the last example above...

<?php

$ffi = FFI::cdef('

  typedef long unsigned int size_t; 
  void *malloc(size_t size);

');

$ptr = $ffi->cast('uint8_t *', $ffi->malloc(27));

for ($i = 0; $i < 26; $i++) {
  $ptr[$i] = $i + 65;
}

for ($i = 0; $i < 10; $i++) {
  print "$i: "; var_dump(FFI::addr($ptr) + $i);
}

...Produces (with the exact same backtrace)...

0: object(FFI\CData:uint8_t**)#4 (1) {
  [0]=>
  object(FFI\CData:uint8_t*)#1 (1) {
    [0]=>
    int(65)
  }
}
1: object(FFI\CData:uint8_t**)#1 (1) {
  [0]=>
  NULL
}
2: object(FFI\CData:uint8_t**)#4 (1) {
  [0]=>
  NULL
}
3: object(FFI\CData:uint8_t**)#1 (1) {
  [0]=>
  Segmentation fault
exikyut commented 2 years ago

image

<?php

$ffi = FFI::cdef('

  typedef long unsigned int size_t;
  void *malloc(size_t size);

');

$ptr = $ffi->cast('uint8_t *', $ffi->malloc(27));

for ($i = 0; $i < 26; $i++) {
  $ptr[$i] = $i + 65;
}

var_dump(FFI::string($ptr, 10));

$ptr += 5;

var_dump(FFI::string($ptr, 10));

$ptr += 5;

var_dump(FFI::string($ptr, 10));

string(10) "ABCDEFGHIJ"
string(10) "FGHIJKLMNO"
string(10) "KLMNOPQRST"
cmb69 commented 2 years ago

Naturally that's done in C by just assigning to ptr + offset (where ptr might be uint8_t * and offset is int), so I wondered if the same worked here.

From the docs:

C array elements can be accessed as regular PHP array elements, e.g. $cdata[$offset]

So the example from your second post should be:

<?php

$ffi = FFI::cdef('

  typedef long unsigned int size_t; 
  void *malloc(size_t size);

');

$ptr = $ffi->cast('uint8_t *', $ffi->malloc(27));

for ($i = 0; $i < 26; $i++) {
  $ptr[$i] = $i + 65;
}

for ($i = 0; $i < 10; $i++) {
  print "$i: "; var_dump(FFI::addr($ptr[$i]));
}

However, the docs also state:

C pointers can be incremented and decremented using regular +/-/ ++/–- operations, e.g. $cdata += 5

So print "$i: "; var_dump($ptr++); would also work.

Still, it should be examinated why normal addition doesn't work, and apparently causes segfaults.

cmb69 commented 2 years ago

Still, it should be examinated why normal addition doesn't work, and apparently causes segfaults.

When an integer is added to an FFI pointer, and the pointer has a refcount of 1 (what is the case in the first example), the type ownership is transferred; that causes the segfault. We could remember the type, but I assume that is not done to avoid unnecessary memory usage for the likely more common case, where the pointer itself is increased (or decreased).

Given that array dereferencing is the prefered method to do this, and that it is also possible to work-around this by increasing the refcount (by assigning to another variable), I tend to leave this as is. We should probably improve the documentation regarding this issue.

exikyut commented 2 years ago

Wow I see. Yeah I wouldn't have figured that out too easily :)

Hmm. I'm basically trying to do FFI substr(): here's a ptr, give length bytes starting at offset. That was both my original goal and what I happen to be trying to do once again in a separate context I'm working on right now.

I'm currently using the FFI so I can a) control memory allocation, b) use mmap() (yay!) and c) speed up hot loops where possible. While it's easily argued PHP's overhead makes the last idea moot to some extent, I understand FFI *does* JIT certain operations, and I was hoping everything would align optimally and I'd be able to get my hands on some 90%-outcome/10%-effort racing stripes without having to pull out C :)

Array dereferencing means I have to write for loops inside PHP userspace, which basically slows everything down across the board by O(1*(PHP overhead)) per length, producing ever-larger "this is not a design win, this is wrong, this is wrong" headdesks as length grows. I see FFI::string() as the logical way to ask for longer runs of data. IMHO, I argue this pattern and context needs revisiting - just to loosen it up a little, not necessarily judge it out of the room - in order for the FFI to scale to real-world use cases.

I could see adding an offset parameter to FFI::string(), but that feels somewhat short-sighted. At the end of the day, IMO FFI users are exceedingly likely to have ether personal experience or project-level exposure to ptr + offset notation, so making FFI\CData + int implicitly do C-style pointer offset calculation makes 1000% sense to me. It's not magic behavior if it's documented! :P

Edit: But the offset parameter idea has some merit, on second thought - I think the FFI\CData + int behavior I'm describing would require the creation of a new FFI\CData object on each invocation, whereas FFI::string($ptr, $start, $length); is just an additional parameter to a function I need to call in both cases. And then there's the fact that the only other real way I can access bytes through the FFI is through array dereferencing in any case. Hmm. Perhaps this is a good idea... and perhaps also FFI::memcpy(), FFI::memcmp(), FFI::memset() etc could have optional to_offset and from_offset parameters added for completeness...? :v

The motivation here is to move the for loop from PHP to C. That's about it.


I couldn't help but become very curious at your mention of refcounts, so I wondered what would happen if I added a $hack = &$ptr; just after the malloc() in the very first example - I'm not sure exactly how PHP works in this area but I was thinking this might increase the refcount. I'm not sure what to make of the fact that PHP segfaulted exactly the same.

But while playing around with this I also found another piece of interestingness:

$ffi = FFI::cdef('

  typedef long unsigned int size_t;
  void *malloc(size_t size);

');

$ptr = $ffi->cast('uint8_t *', $ffi->malloc(8192));

for ($i = 0; $i < 26; $i++) {
  $ptr[$i] = $i + 97;
}

var_dump(FFI::string($ptr, 10));

var_dump(FFI::string($ptr + 5, 10));

var_dump(FFI::string($ptr, 9));

This produces:

$ php8.1 xx3.php
string(10) "abcdefghij"
string(10) "fghijklmno"
PHP Fatal error:  Uncaught FFI\Exception: attempt to read over data boundary

This is with allocating 8K. I initially had malloc(27) in there.

However, on my system (fingers crossed this repros elsewhere, heh), if I lower the 9 to an 8 in the last line, I get:

$ php8.1 xx3.php
string(10) "abcdefghij"
string(10) "fghijklmno"
string(8) " ×EõÆU"

Eeek. Interestingly the last character in what falls out is almost always a U (I saw a V once) so I'm probably reading out a struct.

cmb69 commented 2 years ago

But while playing around with this I also found another piece of interestingness:

That's exactly the same issue as before. $ptr + 5 transfers the type ownership. You can prevent that from happening by assigning e.g. $keep = $ptr.