run4flat / C-Blocks

Embeding a fast C compiler directly into your Perl parser
30 stars 3 forks source link

Wishlist: XS/typemap equivalent automatic function wrapping/type conversion #17

Open tsee opened 7 years ago

tsee commented 7 years ago

This is obviously just a wishlist item. :)

Looking at csub blocks, I can't but feel that it's the "wrong" API. Writing XSUBs by hand, dealing with the Perl stack and type conversions manually is something that is exceedingly difficult and OH MY GOD error prone. I dare say that even most XS module authors would struggle some. Having the ability to do it makes perfect sense, but it clearly is a measure of last resort.

I also realize that XS is ugh, kludgy and the typemap system is pretty awful. But it's pretty much what we have at our disposal to solve the problem I described above (unless one would reinvent that entirely).

As you, David, know, XS::TCC implements something like this. For bystander's context: XS::TCC exports a function that takes a code string as argument. That code will be scanned (by scary regexes wired together to a makeshift C parser, courtesy of Inline::C) for C function signatures. It then extracts the parameter and return types and tries to locate typemaps for them (either user supplied or the defaults that XS::TCC ships) and generates C code for the parameter and return value handling.

So essentially, it does the SANE part of the job of ExtUtils::ParseXS.

Now, a bunch of the code to do this in XS::TCC is a bit meh, but it works and seems reasonably straightforward to port to a use case in C::Blocks. But C::Blocks has the opportunity to have a much better interface! The scan-code-with-regexes thing should really be replaced.

Imagine you had instead:

use C::Blocks;

csub double my_exp(double x, double exponent) {
    return pow(x, exponent);
}

print my_exp(3, 4), "\n";

What this would require is building a C function signature parser using the Perl lexer API. That's not truly hard, but it's likely a fair amount of work (you have more recent experience working with it than I do, so maybe you feel it's not a huge deal).

I think if that bit were implemented and gave a structured parse of the C function signature, I could whip up the code gen reasonably easily...

The API bit that would be missing is that this does not allow for providing custom typemaps. Having a global registry isn't good enough for this (unless it offers local() alike semantics) because that would otherwise introduce action at a distance between unrelated libraries. I'm not sure what would be the cleanest way to do that. Maybe some kind of attribute syntax (ugh).

--Steffen

tsee commented 7 years ago

Okay, I made this work in a branch. Please see the commit message. This is proof of principle only.

https://github.com/tsee/C-Blocks/tree/tsee/nasty_automatic_xs_hack

run4flat commented 7 years ago

Oh wow, thanks! I had a different solution in mind, but I suspect I'll need to use a number of things you've done in your commits.

Currently you can declare the type of a scalar variable:

my Some::Type $thing;

C::Blocks then checks Some::Type for code to unpack and repack the variable when it's used in a cblock. The API for all of this is not too bad, but needs to be refined yet.

My plan was twofold:

  1. Make a simple way to use typemaps to create a valid C::Blocks::Type package.
  2. Make it possible to indicate a function signature using the C::Blocks type system, which is simple enough (i.e. no compound words) that it can be parsed by the keyword handler.

For example:

use C::Blocks::Types qw(double);
csub double my_thing(Some::Type input) {
    ....
}

That's what I was thinking. Do you think it's a good idea? I'll have to look closely at what you have done in your branch to see how much of maps well into my approach.

But as it is, I like that you've used a separate keyword. This means I could pull your work and still play with csub.

run4flat commented 7 years ago

I've been thinking more about this over the evening. I realized that I could support two different keywords, csub and cfun. Both would support signatures.

csub would only add a function to the current Perl package. Based on the signature and return value, it would add code to the top and bottom of the function call to perform the necessary stack manipulations.

cfun would both add a function to the local C scope and add a function to the current Perl package. It would translate into a C function almost verbatim, but it would also add a second wrapper function that would handle the data marshaling.

An annoying difference would be the availability of the return statement in cfun but not csub. It occurs to me that I could support the return statement in csub. I would handle it by creating a C macro that would set a previously allocated return variable and goto the end of the function, where the cleanup and real return would take place.

I'm going to let this percolate a bit more. My current priority is releasing new Alien::TinyCC and Alien::TinyCCx with the latest section alignments that we just figured out.

tsee commented 7 years ago

Sorry for the slowness - life's been busy. My initial response was going to be "I have concerns about the types for this being global due to being Perl namespaces". That means, for example, that if some module wants to implement different semantics for arrays of doubles than another, then there'd be clashes.

But then, I realized that the "exported constant" trick actually works for these type names. Ie. what C::B::Types already does:

package C::B::Types;
use constant double => C::B::Types::NV;
our @EXPORT_OK = qw(double);

package main;
use C::B::Types qw(double); # this could instead import a "double definition from Foo::Types!
my double $foo;

So that's neat! And potentially this could mean a much cleaner re-usability than typemaps.

Now, I think the biggest challenge is defining how types map syntactically between C and Perl identifiers. You allude to that this scheme will work for simple non-composite (ie single-word, no pointers) C types.

In practice, it's only useful if it works for arbitrary C types. AND if the mapping is at least somewhat intuitive. To wit, in the double example, the reader of the code can deduce rather intuitively what the intent is:

use C::Blocks::Types;
cfun double square(double x) {
  double squared = x * x;
  return squared;
}

Seems like magic to a casual reader, but it would work very intuitively!

But let's assume that we have to pick some odd workaround to bridge the syntactic differences of C types and Perl identifiers:

use C::Blocks::Types;
cfun unsigned_int square(unsigned_int x) {
  unsigned int squared = x * x;
  return squared;
}

That's already moderately awful in that it's absolutely not obvious to a casual (but otherwise competent) reader why the syntax differs. In fact, it's very much a false friend! Why can't I use the same type identifier in the C-looking signature and in the actual function body?

Obviously, this can get much worse:

use C::Blocks::Types;
cfun c_string err_msg() {
  char *err_msg = get_error();
  return err_msg;
}

That reads like a total turd to a C programmer, I think. :) (Example chosen for effect.)

The alternative seems to be to actually have REAL C types in the signature and then maintaining a mapping of those C types to whatever Perl package implements the mapping. Conceptually, that would require a different way of declaring new type mappings and likely some kind of global-ish registry. Which brings back my originally discarded concern: Global registries for this sort of thing can cause collisions between unrelated modules using C::Blocks.

I don't have a great idea for a solution to that last conundrum right now. With typemaps, I purposefully refactored ExtUtils::ParseXS to have a class/objects representing typemaps such that they could be swapped around. XS::TCC supports passing in a custom typemap object for that reason. C::Blocks is different in that it tries to make such information more contextual (as with the type imports above). I just don't see an obvious way on how to combine the two benefits.

run4flat commented 7 years ago

Lots of thoughts here. The most important one being: There is a lexically-scoped solution to the global registry conundrum. We just use the hints hash! Type libraries could use their import to add entries to the hints hash. For example, suppose you wanted to use char* to refer to C::Blocks::Types::char_array. I could alter C::Blocks::Types to add something like this to the import method:

sub import {
    ...
    $^H{"C::Blocks/type/char*"} = 'C::Blocks::Types::char_array';
}

Notice how the tail end of the hints hash key is char*. Then, when C::Blocks is parsing a cfun, it would look for something roughly like (.*?)\s+(\w+). It would then check the hints hash to see if the key name "C::Blocks/type/$1" exists, and if so it would use the associated package's methods to retrieve type handling.

It's worth asking why I couldn't get something similar to work with my declarations. For example, it would be pretty cool if I could say my char* $thing. As you know, my only allows package names and constants whose values are package names.

BUT---and this occurred to me while I was writing my response to your ideas---I could override my with the keyword handler. (I tested something similar on Advent Day 12, where I disable the BEGIN keyword.) It could look for the exact same sort of type specifications and inject the correct package name before Perl's parser ever encountered anything suspicious. This would make my char* $thing valid!

tsee commented 7 years ago

Hmm. Before I comment further, I'll do some more "have you seen this" showing off. ;)

I think you might find some other prior art on this somewhat interesting. You've invented most of what's there already, but I think the Perl JIT compiler prototype isn't well known and there's some reasonably clever things in there (most of them courtesy of Mattia Barbon). In this context:

https://github.com/tsee/jit-experiments/blob/master/Perl-JIT/src/pj_keyword_plugin.h https://github.com/tsee/jit-experiments/blob/master/Perl-JIT/src/pj_keyword_plugin.cc

In a nutshell, this introduces a keyword "typed" which does something similar to what you're proposing. We didn't have the chuzpe to define a kw handler for "my" simply out of the belief that that would be asking for trouble (even if it worked for now). You can see that in action here:

https://github.com/tsee/jit-experiments/blob/65cdf192e30a2fc883fb3e7ffe8a13c7f0453077/Perl-JIT/t/900_perf_trivial.t

      typed Double $x = 0.0;
      for (typed Int $i = 1; $i < 1e5; ++$i) {
        $x += $i + 1./$i;
      }

The special keyword for such declarations did always seem a bit icky, I'll admit. I'd still be a bit paranoid about overriding "my".

In the case of the JIT module, we had sort of several passes: During normal Perl parsing, we'd have the keyword handler. That would attach magic to the current CV(!) which held a registry of any PAD entries with a type. Then later, we'd scan the OP tree and convert it to our own AST (undoing some execution focussed things that the Perl compiler does in the process) and then try to JIT-compile chunks of the AST and refer back to the type annotations in the CV. This had the upside (compared to what you're doing) of not being rigidly attached to the odd "my $foo" syntax that's been dormant in Perl for decades, while having the downside of inventing our own.

Anyway, options. Back to the matter at hand: I still have to think about the hints hash thing a bit. It's clever, but it also conjures up a concern: %^H is strictly lexical. Normal exports are package scoped. Which means use C::Blocks::Types qw(double); has both a lexical and a package scoped effect and depending which C::Blocks construct you use, you get either of them. I'm not sure it's a big concern, but it's at least a bit of a smell, right?

Open question for me: What's the syntax for saying "make type mappings available in my scope"? use C::Blocks::Types qw(Int); for example making "int" mappings available is meh because of the syntax inequality...

Putting the scoping concern aside, more comments:

tsee commented 7 years ago

Forgot one bit of commentary:

A variation on the hints hash trick could be to use it to store $^H{"C::Blocks/types"} = $id and then use $id to look up a typemap type object in a global registry. Does that description make sense? In a nutshell, you'd avoid filling up tons of hints hashes with lots and lots of entries per type and instead you install one "use this set of type maps" object in the current lexical hints hash. The $id and global registry thing is necessary only because hints hashes can only store strings (side note: have you read about how they work? It's really kind of a clever way hints "hashes" are implemented, I think).

I haven't thought through how this could give us the desired semantics, but it has the possibility of combining some of the upsides of both extreme approaches. Or just combining all the downsides, I don't know.

run4flat commented 7 years ago

I see I have lots of reading. :-)

Getting back to function signatures, I propose something else to mull over. I wrote about optional named arguments with defaults on Advent Day 13. It would be really, really nice to include that in any function signature work.

Also, how would the any function signature handle arrays? Specifically, in C you need to pass a pointer to the array and an integer with the length. Both of these would come from a single Perl variable, so the Perl-side function would have one fewer argument. One idea that occurs to me is to somehow "join" two C arguments, something like this:

cfun double sum ({double* values, STRLEN N}) {
    double to_return = 0;
    for (STRLEN i = 0; i < N; i++) {
        to_return += values[i];
    }
    return to_return;
}

# In perl
my $data = pack ('d*', ...);
my $sum = sum($data);

# in C
cblock {
    int N = 100;
    double * data = malloc(sizeof(double)*N);
    ... fill data ...
    double sum = sum(data, N);
}

Notice the curly brackets around the pointer and length variables in the function signature, and how the Perl call only has a single argument while the C call has two.

Can this be made prettier?

tsee commented 7 years ago

Regarding the reading: Apologies if that sounded patronizing. Not meant that way at all! You just seem to be into that sort of curious stuff, so I figured I'd share. :)

You bring up a really good point regarding the "arity mismatch" between Perl and C on types. I don't have a great idea on syntax, to be honest. XS has an obscure way to do some of this stuff but I don't even remember how it works. That's because I've always found it a bit of a false economy in my use cases. For arrays, arguably, it makes some sense, but the even more common use case is strings. And there, such a shortcut is an actively harmful thing: Perl strings have the actual byte string, a length, and a bit that indicates how to interpret the char string (UTF8ness). Now, depending on what you intend to do with the string in C, you want any variation of SvPV(+schlepping around the UTF8 bit), SvPV_bytes, or SvPV_UTF8 (misspelling from memory notwithstanding). It's IMO of utmost importance for people to get this choice right and it's a bit of a case-by-case choice.

So in practice, I've often felt it was best to make it explicit by accepting the SV and doing the conversion myself in the C/XS code.

</personal feelings>

Either way, the syntactic and conceptual mapping challenge is interesting but I have no immediate ideas.

tsee commented 7 years ago

I thought a bit more about the "one SV magically gets transformed to several C types/variables" use case.

Essentially, with the current conceptual approach (including the {typeA varA, typeB varB, ...} idea of yours) we're now trying to map from an input space of a list of C types to a bunch of code instead of mapping from a single C type.

We could make all the weird ambiguities work by disambiguating using typemaps. Let me illustrate what I mean using the string example.

Ambiguous incantation:

cfun void something_w_string ({char* str, STRLEN len})

Not clear if this is a UTF8 or a byte string...

Introducing typedefs:

cfun void something_w_byte_string ({PByteChar* str, STRLEN len})
cfun void something_w_utf8_string ({PUTF8Char* str, STRLEN len})

# and for the flexible version, we already have an unambiguous incantation:
cfun void something_w_any_string ({char* str, STRLEN len, int utf8})

The obvious typedefs we'd have inject somewhere are:

typedef PByteChar char;
typedef PUTF8Char char;

I hope you're cringing as you're reading this. In general, this shows that using a type tuple to indicate mapping preferences is fraught with cognitive overhead. But more specifically, I think it shows that it's really easy to commit conceptual sins: PUTF8Char does not actually represent a UTF8 character. It represents one byte/one C char which in UTF8 can be a whole character or just a part of one. Maybe I could have come up with a better name for the typedef, but I think the alternatives would have left similar conceptual issues on the table.

On top of that, this doesn't really express what the author means very well.

So what about a different syntax (which I think is also invalid C signature syntax so should be usable for our purpose:

cfun void something_w_byte_string( bytes_from_perl_string(str,len) )
cfun void something_w_utf8_string ( utf8_from_perl_string(str,len) )

cfun void something_w_any_string ( string_from_perl_string(str, len, utf8ness) )

And then we'd add a way of defining new such mapping functions just like defining new C type mapping functions.

It's not beautiful in that it doesn't include the C types, but including those would mean duplicating them (because the mapping implementation would almost certainly be dictating the C types to be used).

The other bit that's a bit irritating to the eye compared to the {} syntax is that the foo(varA, varB...) syntax looks more out of place in the function signature than adding {}. Maybe using a stronger visual clue to make them stand out as something "special" would help a bit.

One a side note on implementation: If we manage to avoid using any { characters in the syntax, we could simplify the implementation drastically. In a nutshell, I don't think the currently considered syntax (the {} proposal notwithstanding) allows for { appearing in the function syntax. That would mean that we can just use the perl lexer to scan until it hits the first {, which marks the beginning of the actual function body (reliably). Then we can just use a bunch of Perl code with regexes to parse the function signature. That's about 100x less effort than building it by hand with the perl lexer and in fact, ignoring the syntax extension we're discussing, we already have such a parser at hand in XS::TCC. I pushed a slightly fixed up copy of that to a branch for reference (it doesn't do anything now).

https://github.com/tsee/C-Blocks/tree/tsee/signature_parsing

Am I making any sense?

run4flat commented 7 years ago

I am being interrupted every minute while trying to write my response, so I'm going to send a gazillion small messages.

The more that I think about it, the more I feel that the arity mismatch should not be a design target. The only time I think it would occur is when we're dealing with arrays, such as strings or packed scalars. I agree that anyone wanting to work with strings should be using Perl's SV*. But what about arrays of numbers? We could try to add support for this kind of syntax concern, but I think that's not the best use of our time.

run4flat commented 7 years ago

I would rather insist that anybody who wants to support functions with low-level arrays should be required to supply the length in their Perl-side function call. This is in part to keep our lives easy, and to encourage the use of C structs to pass around meaningful information.

run4flat commented 7 years ago

Furthermore, I have worked out a design for a parallel C/Perl object system that supports single inheritance and roles. This would completely simplify our design, because then the C type would have the exact same name as the Perl package for the same class. I would rather focus efforts on supporting this kind of object system rather than supporting the arbitrarily complex C type notation.

tsee commented 7 years ago

Can't wait to hear about it! (Or read the code.)

run4flat commented 7 years ago

See #25