tinyplasticgreyknight / modern-docs

1 stars 0 forks source link

finish listing all the C library functions and types #1

Open tinyplasticgreyknight opened 9 years ago

IreneKnapp commented 9 years ago

Right - and don't forget that in addition to things that are exported as functions, there are several structs of function pointers which should be documented, too. It might be possible to figure out, but let me give you the rationales for why each exist, informally, and you can word this however you want:

As usual, this got really long. I'm afraid I got impatient by the end, but at least it's a starting point! Explaining this did serve to remind me of the weaknesses of the current design, which is definitely a plus.

void * client_state

Probably obvious, but everywhere this appears, it's an opaque value that's passed by client code alongside callback functions anywhere that it's providing them, and is in turn passed through to the callbacks when they're later invoked. The docs should be as clear as possible about which client_state is used for any given callback invocation.

Also I'm not thrilled at the word "callback" because, as you'll see, there are a few places where it might be a call-forward in the sense that the client code is passing a function pointer to be called later, but the function pointer could actually be implemented by either client code or library code depending on the usage scenario. Feel free to get creative with the wording.

typedef void modern;

This is the type of a node. It's completely opaque, and would normally be used as modern * since of course you can't have a void but can have a pointer to it. Note that it's opaque not just to client code, but also everywhere in the library except for the implementation of the default modern_node_representation.... that's part of a feature whereby language bindings that would prefer to use their native objects rather than the C heap, can do so by providing their own node representation. See more on this with the node-representation stuff below.

typedef void modern_context;

The docs you read probably explained this, but a context is notionally a hash table keyed by the structural hash of every type that's been introduced. The idea here is that it should always be valid to concatenate two schemas, as long as the magic number and the "... okay, now here's the data part" stream event are omitted. So high-level types that are provided as part of the system, but don't require their own node types or builtins, are just magically already part of the initial context. These will include helpers that build tagged record types, for example.

typedef void modern_library;

This is the opaque object that's passed to just about everything in the system. It has no mutable state, and in fact the test suite checks this. It exists as a container for the basic callbacks that make the system work; the idea is that having an opaque container means there's no need to deal with bizarre conditions that could arise as a result of swapping a callback for another halfway through.

It would be useful for the docs to walk through the steps to getting one of these set up...

extern modern_library *modern_library_initialize
  (struct modern_error_handler *error_handler,
   struct modern_allocator *allocator,
   struct modern_node_representation *node_representation,
   void (*finalizer)(void *client_state),
   void *client_state);

The error handler and the allocator structures need to be created by client code, since there's no library function that gives them. They're separated from each other both on general principles and because the allocator, but not the error handler, is also needed as part of constructing the default node representation, which needs to be done prior to initializing the library since it's used here. :)

The node representation may be created by either client or library code; see my remarks on it specifically, further down.

struct modern_error_handler

The error handler is actually a whole list of callbacks, one for each condition that can arise; a polymorphic error object would be more idiomatic in many ways, but constructing that object would require allocation, which makes it pretty useless when the condition being reported is an out-of-memory. I know most software doesn't treat OOM as a condition that could actually happen anymore, but...

For a few of these callbacks, I've perhaps been overly cute with their names. memory is the only memory condition that the library is in a position to detect; that is, it's called when alloc or realloc return NULL. I suppose I'll have to poke into the code to see what most of the rest of these mean; that's unfortunate... I believe these are all conditions that I actually encountered while writing the library, and couldn't structure things to make impossible. It's probably not the final complete set of them though.

struct modern_allocator

These type signatures aren't quite the same as the standard malloc(), free(), and realloc(); they have a client_state in them. That means C-based client code will need to provide its own wrappers. Probably very few language bindings will need to use client_state here, since allocator state is among the things that language runtimes have already figured out how to share, but it's there just in case.

struct modern_process

Not to be confused with modern_processor... In the docs I'd definitely describe this after modern_stream. It just needed to be declared first.

So there's a tangle of objects here. Be sure to look at the two diagrams in Documentation/, I like to think they make this a lot clearer.

Notionally, in the case of deserializing from a binary input stream, the processor is responsible for the overall control flow. The process struct is owned by the processor, and passed by it to each stream function the processor calls. The process struct contains callbacks abort and flush which can be called by the stream functions to control the overall flow.

For the abort case, this could have been implemented as a return value from the stream functions. I felt like this was a nicer design, mostly because it's more uniform since it works the same way as the flush case. It sets a flag and causes the processor to break out of its run if it's in one (that's meaningless if it's a step instead), and in any event take no other action once the stream function returns, except to clean up.

The flush case is an nop for an input processor; for an output processor it makes sure that everything that's been requested has actually been sent outward via the vfile callbacks. That's useful if you're constructing an isolated snippet and need to explicitly make sure it's complete. ... Or it might be, at least; I can't quite convince myself either way. It seemed necessary at some point.

struct modern_stream

So, as a schema is deserialized, the callbacks in this struct are responsible for statefully building the node representation. This could also be thought of as analogous to a SOX-style API for parsing XML... but it's become clear to me that unless the task at hand doesn't care at all about the node structure, it would be a bad idea for client code to do that. Like the XML scenario, the callbacks here are responsible for keeping track of and understanding containment relationships. Unlike XML, the stream functions also have to handle type checking... right now the library doesn't have an explicit way to call the type checker or evaluator, in part because neither of those rather critical components is actually finished. :) I'm pretty sure that having the library do that very fiddly task is most of the value of the library, meaning that you'd almost always actually want the node-based interface.

There is definitely at least one use-case for making the stream callbacks swappable, though... bear with me as I set up more background. :)

The case described above is input (deserialization), and in that scenario, the library provides the processor, which understands either the binary format or the text-based "explicatory" format depending on how it was set up, then invokes the appropriate stream events. The stream callbacks are likely to also have been provided by the library, of course, but client code is responsible for passing them in, and in theory could use its own.

For output (serialization), it's similar; there's no function to create one right now, but the library should certainly provide a processor which traverses a passed-in node tree and outputs the appropriate stream events. The set of possible events is the same, so the mechanism of output is that the processor invokes stream callbacks which, instead of building a node tree, write output in the low-level format.

The utility of swapping the callbacks is that it needs to be possible to write code that translates between the binary and textual formats.

... I'm not totally convinced by this anymore though; it feels as though the roles of the processor and the stream overlap a fair amount and should be combined into one. Maybe it's just fatigue from writing this long answer? I'll give it more thought, and let me know if you have any.

struct modern_vfile

Relatively standard trick. It abstracts the notion of a file so that the use of stdio can be avoided entirely if necessary; also, this lets the "file" actually be a memory buffer.

There's deliberately no seek, tell, flush, open, or close functionality. Whether those are possible and how they work are specific to the nature of the backend, and this code isn't supposed to need them. This is not going to be sufficiently versatile for situations where one only wants to read a portion of a file into memory, or where one wants to modify it in-place, and I'm unclear how to enable either of those. More work is definitely needed, but I'm not sure it'll be practical to play around with that until the basic functionality is in place.

I'm pretty sure there are no scenarios where both read and write are needed, and it might make a lot more sense to pass the applicable one instead of having this struct. Thoughts?

struct modern_processor

Described in great detail above. The idea of including initialize and finalize is that they manage the state of any given processing attempt, and you might have several going in parallel on different data streams. Meanwhile, step and run apply to a particular attempt, determined when they're called.

struct modern_node_representation

This provides all the functions that look inside an opaque modern * and access the node-type enum and whatever fields each particular type might have. If you're in doubt, a good reference to which node types have which fields is struct modern in internal.h, but you've said stuff that leads me to believe you already found that. It's worth noting that the serialized schema format almost certainly won't pad these to all be the same size, though they will be in memory because of the way C unions work.

Notice that this is one of the parameters to modern_library_initialize. As alluded to above, a language binding for Modern Data might want to provide its own implementation. It definitely doesn't make sense to mix these within a client, so a reference to this becomes part of the opaque modern_library which needs to be passed to pretty much all the real functionality. For client code that's written directly in C, or using a binding that doesn't care, the default implementation is acquired with:

extern struct modern_node_representation
  *modern_node_representation_default_make
  (struct modern_allocator *allocator,
   void *client_state);

Most data structures that the library uses provide their own finalizers, but since this one is needed so early in the flow, I didn't see a way to do that and there's a specific cleanup function for it:

extern void
  modern_node_representation_default_finalize
  (struct modern_allocator *allocator,
   void *client_state,
   struct modern_node_representation *node_representation);

Notice that there's no modern_error_handler parameter, so in the event of failure, this just returns NULL. The only way it can fail is if the allocator fails, though, so there's nothing to distinguish.

This is set up as an accessor function rather than a const struct, because quite a few languages have difficulty accessing constant data structures via their FFIs, but function calls are supported everywhere.

enum modern_node_type

This is probably obvious to you by now. :)

enum modern_builtin_identifier

Ditto.

tinyplasticgreyknight commented 9 years ago

I've added automatic type signature documentation for the various callback structs; no semantics yet though. The typedefs are basically all opaque so I haven't written anything for them other than the headers.

IreneKnapp commented 9 years ago

So I see! Exciting. :)

tinyplasticgreyknight commented 9 years ago

All the C functions are now at least listed as of commit https://github.com/tinyplasticgreyknight/modern-docs/commit/73c3c5a5e5ce1a278fc388145da9583871c07658