vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.11k stars 1.6k forks source link

chore(config, docs): handle initial integration of newly auto-generated source documentation into existing Cue #14998

Closed tobz closed 1 year ago

tobz commented 2 years ago

Once #14814 is completed and we have a way to generate (near-)equivalent Cue documentation from the configuration schema itself, we need to tweak the existing Cue documentation to utilize it.

Main Outcomes

List of sources to convert

(This is naturally a follow-on of #14815)

tobz commented 2 years ago

Conversion guide!

The process of "converting" a component to use the machine-generated Cue documentation essentially boils down to removing the configuration data in the "existing" Cue file, and replacing it with a link to merge in the machine-generate "base" Cue file. There's some additional, context-specific changes you'll likely have to make to bring the machine-generated Cue documentation up to parity with what already exists.

We'll run through pretending we're going to convert the file source.

Ensuring the machine-generated Cue documentation is up-to-date

Before starting, and after every change to the Rust source, you'll need to run make generate-component-docs. This will re-compile Vector, use the resulting debug binary to generate the configuration schema output (which gets written to /tmp/vector-config-schema.json), and then run scripts/generate-components-docs.rb against that generated configuration schema. All of the machine-generated Cue files will be regenerated, and you should see changes to a given component reflected in the corresponding Cue file.

Finding the relevant files

All of the Cue files for the user-facing documentation live under website/cue/reference. For components specifically, they're nested one level down, in website/cue/reference/components. You'll want to start by opening both the "existing" and the "base" Cue files. The "existing" Cue file is the one currently powering the user-facing documentation, and the "base" one is the one that's machine-generated and only deals with configuration fields, and none of the other stuff: status labels, "How It Works", etc.

For the file source, the existing file is website/cue/reference/components/sources/file.cue and base file is website/cue/reference/components/sources/base/file.cue. This pattern holds true for all components, if you substitute source for whatever the component type is: source, transform, or sink.

I find opening the files up side-by-side works well for cross-referencing between the two, but it's up to you, obviously.

Merging the base file into the existing file

Now, we're going to show how to merge the base file into the existing file. Please note: you should follow the "truing up the machine-generated output" section (below this one) before doing the merging, because you'll want to have the existing Cue file untouched and you compare and see how close/far off the base Cue file is from the existing one. With that said...

In the existing file, you'll want to delete everything under the configuration section, and replacing it with the following:

configuration: base.components.sources.file.configuration

This tells Cue to pull in all of the values under base.components.sources.file.configuration -- which points it to the machine-generated "base" file -- and use them directly.

Additionally, you'll need to disable the "autogeneration" of configuration fields that happens based on the data defined under features. The features section is used to not only drive sections like "How It Works", but in some cases, the addition of specific configuration fields themselves.

You'll want to add auto_generated: true to the top-level of features (see an example here) which will disable any of the Cue logic that automatically adds configuration fields based on whatever is set under features, but allows the other logic (generating "How It Works", etc) to continue functioning.

Merging the two files is literally that simple. You can go into the website directory and run make cue-build to actually test the Cue-to-big-JSON-file step that ultimately processes all of the Cue data and turns it into the JSON blob used to render the user-facing documentation.

At this point, you may likely encounter Cue issues. Most of the existing Cue serves as both a data source and a data definition i.e. schema. This means that as we start using the merged configuration fields, existing schema definitions in the Cue may no longer validate against the merged-together Cue. As part of the first few conversions, we should be able to iron this out, in sort of a ... once we fix one source, the rest should go smoothly, without issue. Suffice to say, we'll be working together closely to get Cue issues sorted so that people aren't languishing, or expected to figure out all of this themselves.

Now, let's talk about the hard part...

Truing up the machine-generated output

Before we actually do the conversion to merge in the base Cue file, we'll want to "true up" the machine-generated output itself. With the files open side-by-side, you'll find the configuration section of the Cue data, and you should see similar (identical, if you're lucky!) fields listed out. What we want to do here is get these as close as we reasonably can, unless the machine-generated output is actually superior.

Let's walk through some examples below.

Common discrepancies and how to approach them

Field titles/description with different formatting

This one is pretty straightforward. During the original work to add the configuration schema support to the source code, much of the existing Cue documentation was ported, in terms of field titles/descriptions, back into the source. For example, the file source has a field called data_dir which in the current documentation reads as:

The directory used to persist file checkpoint positions. By default, the global data_dir option is used. Please make sure the Vector project has write permissions to this dir.i

In the machine-generated documentation for the same field, we have:

The directory used to persist file checkpoint positions.

By default, the global data_dir option is used. Please make sure the user Vector is running as has write permissions to this directory.

This is identical except for the formatting/line breaks. In general, we should prefer to break up field titles/descriptions. It makes reading things easier. That's why it's this way in the doc comments in the source code.

If you see any existing doc comments in the source that are more like the first example, don't be afraid to break them up a little. There's usually some intuitive rhyme and reason to how/when to split, such as if there's a sentence about what the default value/behavior is, or what happens if a field is enabled/disabled, etc.

Also don't forget that all doc comments must have the triple forward slash (///) in order to be picked up for/inserted into the configuration schema. That also means that if you see TODOs or developer-only notes in the doc comments, you should change them to use the double forward slash syntax (//) instead, which will exclude them from being added to the configuration schema.

Field titles/descriptions with wildly different content

Like the above example, some license was taken when porting over the existing Cue documentation in order to clarify wording and present a more consistent "voice" throughout the documentation. This also included leaning on the ability to "derive" a field's title/description from the underlying field type.

For example, the acknowledgements field derives its title/description from the implementation of SourceAcknowledgementsConfig itself. This is so that common types, such as SourceAcknowledgementsConfig or the various encoding/decoding types, TLS, and so on, can specify a single, vetted title/description that all components are able to benefit from. However, not all of these derived types have a suitable title/description.

In some cases, the doc comments may have been written with more of a developer mindset, providing details and terminology that will never be relevant to users, and even if it's technically accurate, we might end up wanting to massage the wording to better suit the user-facing documentation.

In these scenarios, you generally have two options:

In some cases, the way we would word the user-facing documentation for a particular type may fit well with the developer-facing doc comments, and so writing those doc comments is an acceptable approach to take. In other cases, such as SourceAcknowledgementsConfig, the developer-facing and user-facing documentation has good reason to be meaningfully different: the type exists to paper over us deprecating setting acknowledgement behavior on sources themselves. We want the developer-facing documentation to acknowledge this, so that developers understand why it exists (even though it looks like a redundant version of AcknowledgementsConfig that should be refactored/deduplicated away). We also want the user-facing documentation to explain the intended behavior, as well as carry a message that setting acknowledgement behavior on sources is deprecated.

In order to do this, we use the aforementioned helper macros to override what the configuration schema uses for the title/description, which lets us leave the developer-facing documentation as-is. That looks roughly like this:

/// Source-specific end-to-end acknowledgements configuration.
///
/// This type exists solely to provide a source-specific description of the `acknowledgements`
/// setting, as it is deprecated, and we still need to maintain a way to expose it in the
/// documentation before it's removed while also making sure people know it shouldn't be used.
#[configurable_component]
#[configurable(title = "Controls how acknowledgements are handled by this source.")]
#[configurable(
    description = "This setting is **deprecated** in favor of enabling `acknowledgements` at the [global][global_acks] or sink level. \
Enabling or disabling acknowledgements at the source level has **no effect** on acknowledgement behavior.

See [End-to-end Acknowledgements][e2e_acks] for more information on how Vector handles event acknowledgement.

[global_acks]: https://vector.dev/docs/reference/configuration/global-options/#acknowledgements
[e2e_acks]: https://vector.dev/docs/about/under-the-hood/architecture/end-to-end-acknowledgements/"
)]
#[derive(Clone, Copy, Debug, Default, Eq, PartialEq)]
pub struct SourceAcknowledgementsConfig {
    /// Whether or not end-to-end acknowledgements are enabled for this source.
    enabled: Option<bool>,
}

It's not the prettiest thing, but you'll note how we override both the title and description, and go as far as using normal Markdown syntax to provide links and emphasis when mentioning that the setting is deprecated.

Fields with missing examples

Many of our configuration fields have "example" values that we show in-line in the user-facing documentation. Those are entirely hand-written, and so must be ported back into the source code.

We can add examples easily by using the following helper macro:

/// Proxy endpoint to use when proxying HTTP traffic.
///
/// Must be a valid URI string.
#[configurable(validation(format = "uri"))]
#[configurable(metadata(docs::examples = "http://foo.bar:3128"))]
#[serde(default)]
pub http: Option<String>,

Only string values can be passed in, but you can use raw strings, multi-line strings, whatever, so long as it's a valid Rust string literal. You can also specify multiple of these which will lead to the examples simply being merged together:

#[configurable(metadata(
    docs::examples = "http://foo.bar:3128",
    docs::examples = "https://wee.wooo",
    docs::examples = "http://oh.noooo.com",
))]

Fields with missing/different units, or "syntax" values

We've juiced up our user-facing documentation very well with lots of metadata like the units for a given field, or "syntax" values that allow richer output, such as any field which accepts a VRL program fragment having a remap_program syntax, which adds richly-styled boilerplate to that field's documentation to demonstrate usage.

The base documentation generation script -- generate-components-docs.rb -- has a few escape hatches here to allow manually specifying some of these values just to make it easier to match the existing user-facing documentation.

Adjusting the "syntax"

You can directly specify the "syntax" of a field by using the following helper macro:

#[configurable(metadata(docs::syntax_override = "remap_program"))]
some_field: String,

This will directly adjust resulting syntax value. However, you should be aware of the following syntax value(s) which are typically set when using the appropriate field type:

Adjusting the units

You can directly specify the "units" of a field by using the following helper macro:

#[configurable(metadata(docs::type_unit = "milliseconds"))]
some_field: u64,

This will directly adjust resulting unit value for the given type. This only applies to numeric types. However, you should be aware of the following common cases where the unit value are typically set when using the appropriate field type:

Default value for the field is incorrect

Some component configuration types will implement Default, whether by hand or derived, and then apply #[serde(default)] in order to provide a default for all fields based on the value of the field in the default value for the overall type.

This is bad for the configuration schema because it makes it harder for us correctly propagate all of the default values down through the fields so that it's generated/displayed correctly in the machine-generated output.

The fix, and overall more obvious approach, is to specify the default value of a field at the field itself, like so:

#[serde(default = "default_max_retries")]
pub max_retries: u32,

There's many examples of this in the codebase if you search for serde(default = " so I won't belabor the point here.

In almost all cases, we should never have #[serde(default)] specified on the component's configuration type itself, although having an implementation of Default is fine. In some cases, it could be useful/helpful to actually set the field-specific default value by using Derivative to derive a Default implementation that fields can set their own defaults from, like this:

#[configurable_component(source("amqp"))]
#[derive(Clone, Debug, Derivative)]
#[derivative(Default)]
#[serde(deny_unknown_fields)]
pub struct AmqpSourceConfig {
    // Buncha other fields....

    #[configurable(derived)]
    #[serde(default = "default_framing_message_based")]
    #[derivative(Default(value = "default_framing_message_based()"))]
    pub(crate) framing: FramingConfig,
}

In the future, we might be able to come up with some better helper code/macros to get rid of some of this boilerplate, but this is just one example of doing things with serde/Derivative both in play.

Hiding a field entirely

In some cases, we intentionally omitted configuration fields from the user-facing documentation after deprecating them, or perhaps they're new options that are meant to be hidden as they're a stopgap for some particular bug, added to help out a user, who knows. The configuration schema only knows what it's told to have, and the source code by itself has no way to signal these things unless we tell it what to do.

You can completely prevent a field from showing up by using the following helper macro:

#[configurable(metadata(docs::hidden))]
some_field: String,

This simply hides it from the machine-generated documentation, but leaves it intact in the configuration schema.

Marking a field as deprecated

Related to the above, sometimes we need to mark a field as deprecated. We should always avoid only marking fields as deprecated by manually adjusting their doc comments, and instead should prefer to programmatically deprecate them via the configurable helper macro:

/// The Datadog region to send data to.
///
/// This option is deprecated, and the `site` field should be used instead.
#[configurable(deprecated)]
region: Option<Region>,

In the future, some of the doc comments like this will be programmatically calculated based on the serde aliases configured on the field, and so on, but for now, we should use the configurable helper macro to programmatically deprecate a field, and then adjust the doc comments as needed to provide the relevant context around what the new thing to use should be.

neuronull commented 1 year ago

Linking to the bootstrap PR that converted a couple sources, for reference:

https://github.com/vectordotdev/vector/pull/15502

spencergilbert commented 1 year ago

An additional note, some of the links from the old cue files may not work when directly ported to the rust comments. Example

spencergilbert commented 1 year ago

I believe this can be closed @tobz @jszwedko? The conversion guide is a nice piece of documentation if we wanted to move it somewhere more permanent.

jszwedko commented 1 year ago

Agreed, thanks for flagging! This looks like it is complete.