jrschumacher commented 5 months ago

ADR: NanoTDF Attribute Storage Optimization in 255-bytes

Context and Problem Statement

We need to store attributes in the nanotdf policy header, but it has a maximum length of 255 bytes. The current attribute value in ztdf is in the form of a Fully Qualified Name (FQN) as JSON. Even removing the JSON overhead, the FQN is still too verbose to store multiple attributes within the 255-byte limit.

Example:

https://namespace.com/attr/classification/val/topsecret

The goal is to define a syntax that will compress the data to allow for efficient storage of multiple attributes within the 255-byte limit.

Considered Options

Schema-Based Syntax with Full URLs
Index-Based Syntax
Protobuf Compression

Decision Outcome

We have decided to use the Schema-Based Syntax with Full URLs. This decision was made based on the need for a federatable and customer-friendly approach that retains full attribute names and avoids using indexes.

We also considered Protobuf Compression for further optimization, however this makes ease of debugging more difficult since the data cannot be easily read without a protobuf decoder.

Options

Option 1: Schema-Based Syntax with Full URLs

Format:

{schema}|{base_url}|{attribute}:{value,{...value}}\n{attribute}:{value,{...value}};...

Components:

Schema (schema): A digit representing the URL schema (0 for HTTP, 1 for HTTPS).
Base URL (base_url): The full namespace URL without the schema.
Attributes: {attribute}:{value} pairs separated by semicolons (;). Multiple values within an attribute are separated by commas (,).

Example:

1|namespace.com|classification:topsecret;relto:usa,gba,cda
1|ns.namespace.com|group:a

Advantages:

Retains full attribute names and base URLs, making it customer-friendly and federatable.
Clear and easy to parse structure.

Disadvantages:

Attributes starting with numbers (0 or 1) need careful handling to avoid confusion with schema indicators.
Slightly more verbose due to retaining full URLs.

Approximate Range of Attributes

Given the 255-byte limit, the number of attributes that can be stored depends on the length of the base URLs and attribute names. For estimation:

Assume average domain name (.com) length: 13 bytes
Average attribute name length: 5-15 bytes
Average value length: 1-10 bytes
Delimiters and schema indicators: 3-10 bytes

Example calculation for a single attribute set:

1|namespace.com|classification:topsecret

This example is about 40 bytes.

For multiple attributes:

1|namespace.com|classification:topsecret;relto:usa,gba,cda

This example is about 60 bytes.

For multiple attributes across multiple namespaces:

1|namespace.com|classification:topsecret;relto:usa,gba,cda
1|namespace2.com|classification:topsecret;relto:usa,gba,cda
1|namespace3.com|classification:topsecret;relto:usa,gba,cda
1|namespace4.com|classification:topsecret;relto:usa,gba,cda

This example is about 240 bytes.

Therefore, approximately 15-20 attributes of similar length can be stored within the 255-byte limit.

Example

See playground https://go.dev/play/p/M9s8QOtTn4Y

Option 2: Index-Based Syntax

Format:

{schema}|{index}|{attribute_index}:{value_index};{attribute_index}:{value_index};...

Components:

Schema (schema): A digit representing the URL schema (0 for HTTP, 1 for HTTPS).
Index (index): A numeric index representing the base URL.
Attributes: {attribute_index}:{value_index} pairs separated by semicolons (;). Multiple values within an attribute are separated by commas (,).

Example:

1|1|1:1;2:2,3,4

Advantages:

Extremely compact representation.
Potentially allows storing a higher number of attributes within the 255-byte limit.

Disadvantages:

Requires a predefined mapping of indexes to base URLs and attributes, which is not federatable.
Harder to manage and less transparent to customers.

Option 3: Protobuf Compression

Protobuf can serialize the data into a compact binary format, potentially reducing the size further than ASCII or other text-based formats.

Advantages:

Compact binary format that is efficient for storage and transmission.
Strongly typed data ensures consistency and integrity.
Supports multiple programming languages and versioning.

Disadvantages:

Requires additional tooling and setup to define and compile Protobuf schemas.
Requires additional tooling to decode and read the binary data.
May not provide significant savings over the schema-based approach for small datasets.

Protobuf Example

syntax = "proto3";

enum Schema {
  HTTP = 0;
  HTTPS = 1;
}

message Attribute {
  string name = 1;
  repeated string values = 2;
}

message AttributeSet {
  Schema schema = 1;
  string base_url = 2;
  repeated Attribute attributes = 3;
}

damorris25 commented 5 months ago

No concerns from my POV

sujankota commented 5 months ago

Attributes are stored in Policy the max size is 2^16 -1 in NanoTDF?https://github.com/virtru/nanotdf/blob/master/spec/index.md

jrschumacher commented 5 months ago

@sujankota according to https://github.com/virtru/nanotdf/blob/master/spec/index.md#342-policy the policy has a Maximum Length (B) of 255. Am I misreading this?

CleanShot 2024-06-03 at 15 51 12

sujankota commented 5 months ago

We use Embedded Policy for nanoTDF.

sujankota commented 5 months ago

Encrypted policy could be upto 64kb

jrschumacher commented 4 months ago

This work is not needed (see comments above).

opentdf / platform