bretg commented 3 years ago

There are so many server-side adapters now that the PBS uids cookie has grown so large that it's starting to affect what other cookie values the host company domain can receive. e.g. my uids cookie is 3900 bytes. (!)

We need to address this.

Here are some values from my cookie:

"rtbhouse":{"uid":"0wWgMF3hXfr8PHKLfY6t","expires":"2021-09-14T21:10:07.402Z"}
"gumgum":{"uid":"u_c3dfbb5b-a7ca-4e65-9a1e-af26396e1ff8","expires":"2021-09-14T21:10:09.065Z"}
"triplelift":{"uid":"9328941297032053459","expires":"2021-09-14T21:10:09.63Z"}
"smartadserver":{"uid":"1814551064657459315","expires":"2021-09-14T21:10:10.895Z"}
"trustx":{"uid":"a11dcb02-b540-4caf-9964-5e338a0809ce","expires":"2021-09-14T21:10:15.324Z"}
"adform":{"uid":"1707054018971720697","expires":"2021-09-14T21:10:16.778Z"}
"brightroll":{"uid":"y-Q2s1jsNE2oJGw_4G5H1Bl7ujyBlC.zQf1y3C_yBISlvZ8o6yJXdsVA--~A","expires":"2021-09-14T21:10:16.991Z"}
"aja":{"uid":"s2suidLVGJLBaPujvTU7axZBrQ0QShiSi4xDAlP1z3TC4fm8OYkiQVqADI1UlFQzIy36Pn9-p2G4","expires":"2021-09-08T19:08:36.147Z"}

Totally, I have 32 bidder entries my cookie, for an average of 121 bytes per bidder encoded.

(See below for a major update to the original proposal.)

Current Structure

The expires value is used to drop the value from the cookie so /cookie_sync will get an updated ID from that bidder.

Structure of the current cookie:

{
  "uids": {},
  "tempUIDs": {
    "BIDDER": {
      "uid": "AQEIu9KdNAyV4wEISYkJAQEBAQE",
      "expires": "2021-09-14T21:10:50.503Z"
    },
  },
  "bday": "2021-05-11T15:15:24.619Z"
}

Background on the current structure

Here's the comments from the code: (usersync/cookie.go) // "Legacy" cookies had UIDs without expiration dates, and recognized "0" as a legitimate UID for audienceNetwork. // "Current" cookies always include UIDs with expiration dates, and never allow "0" for audienceNetwork. // // This Unmarshal method interprets both data formats, and does some conversions on legacy data to make it current. // If you're seeing this message after March 2018, it's safe to assume that all the legacy cookies have been // updated and remove the legacy logic.

Possible new structure

Here's a straw proposal based on using a relative timestamp rather than absolute:

{
  "v": 3,   // implement a version number so /cookie_sync can parse the different formats
  "expbase": "2021-08-14T21:10:50.503Z", // expires base timestamp
  "uids": { // get rid of the tempUIDs field. 
    "BIDDER": {
      "uid": "AQEIu9KdNAyV4wEISYkJAQEBAQE",
      "exp": "22" // days from expbase
    },
  },
  // get rid of the 'bday' field.
}

bretg commented 3 years ago

Ran an experiment with an online tool to see whether the uids cookie would be helped by protobuf. In short, I found we're be better off with compressed JSON. Here's what I did.

Baseline JSON

The contents of the JSON uids file with 20 IDs:

    version: 2,
    expirationBase: 1630698267,
    uids: [
        {
            bidder: "osirjgosj",
            uid: "3487rn8wu4niuesrgiuhreuhsiuerw9uersuj",
            expirationOffset: 5
        },
        {
            bidder: "oiuiuosr",
            uid: "984579wmeg huibfsouisrhiuosrgnosriunsbru",
            expirationOffset: 8
        },
        {
            bidder: "oisoisvn",
            uid: "5878 w9urhuhw9urs ehusierhisudr9suer9ms8e5u9ser8u",
            expirationOffset: 8
        },
        {
            bidder: "jhrubuyr",
            uid: "9w4usireuiosroirusjsijroserijuhiuhuih8943eousxrf",
            expirationOffset: 8
        },
        {
            bidder: "nvxusruy",
            uid: "98w4mt9wm49chwe8orghms8erhgiosuhthidstohjdotho",
            expirationOffset: 8
        },
        {
            bidder: "ajiuneuwy",
            uid: "w8754mtc8whmg8uhsermg8uhwsr8gh8merhgoisduhtgosiuhrgoisuherg",
            expirationOffset: 8
        },
        {
            bidder: "ueusyeb",
            uid: "q23uyhfoawyuehrg8myseh8rghwe85mctg8tsehrgm89chse",
            expirationOffset: 8
        },
        {
            bidder: "hgsdguaifu",
            uid: "19u9473789a87sr8ymrc8uyrt8gdhiudth9se74984wu9s8euodisrdiohugiu",
            expirationOffset: 8
        },
        {
            bidder: "jhsdhjggr",
            uid: "287w48w73ymt8s8drhisudhrf9s9ser589e89hr9fudo089re9e84uoisjosijopwkpk",
            expirationOffset: 8
        },
        {
            bidder: "sisdishi",
            uid: "68eha84a8a9w4373f99393w89w734yw8468w64m8 syg48yesgm8 y4e8yseisyhf",
            expirationOffset: 8
        },
        {
            bidder: "lskjdhsifu",
            uid: "q873847q6mw4h9m8 h4e97c4my9chmy48se7cy48c9s7ey49s7ce4y9se4759ses9",
            expirationOffset: 8
        },
        {
            bidder: "sttysufuhsf",
            uid: "858958958989e976896e8868968989896e89yt9utw9u9ustuisgihg",
            expirationOffset: 8
        },
        {
            bidder: "yshdfbvishr",
            uid: "7857w9579h97gw97w9763673675357452457427524756426752uywriyfwhwhgui4e",
            expirationOffset: 8
        },
        {
            bidder: "vaieirausv",
            uid: "7547w56769869w4736667569000y9486565w3759090y687y636535646ujgi576twge",
            expirationOffset: 8
        },
        {
            bidder: "cusinuesinu",
            uid: "64w86q4597y68997wq486e5086e79w68479e56898w7579wru95978w580w5s79",
            expirationOffset: 8
        },
        {
            bidder: "xargsr",
            uid: "387s8745s8764t795es97s5708ws0845s68w4685w508ws97w97s5498se5987shghj",
            expirationOffset: 8
        },
        {
            bidder: "ssgdsrgs",
            uid: "0s0roaisrej9sawrga00s8es47876sw674756aq57s7e4iusui9su89u8serh",
            expirationOffset: 8
        },
        {
            bidder: "zsrfgsrg",
            uid: "78sdsuhgvuyvsotirjhsporiyaburiseung3ibaieyraiyureiaireauhiiuyiuaiuyrsibyurh",
            expirationOffset: 8
        },
        {
            bidder: "qsfrgs",
            uid: "8s9rusvjrsiuniurnvs7s78erv8srvn8shrvsiyhrv8ysrvshifhbsiruhigru9eiru",
            expirationOffset: 8
        },
        {
            bidder: "fshyfjy",
            uid: "98ae98ay4m9a8cm948ym94y9ah4eauyeh8aw7y38a73yw63476nt3486s4cgm87sc4my84m7",
            expirationOffset: 8
        }
    ]

This is 1441 bytes gzipped and base64 encoded. Note that I took care to minimize repeated strings so as not to favor gzip. When I used the same uid or bidder values, gzip did much better. :-)

Protobuf

Came up with this protobuf definition:

syntax = "proto3";
message uidCollection {
  message uidObject {
    string bidder = 1;
    string uid = 2;
    uint32 expirationOffset = 3;
  }
  uint32 version = 1;
  uint64 expirationBase = 2;
  repeated uidObject uids = 3;
}

Used the tool at https://www.sisik.eu/proto to create a binary representation of the JSON above and it came out to be 1522 bytes.

bretg commented 3 years ago

Discussed in PBS committee. @SyntaxNode has a hypothesis that protobuf would have the better CPU-performance profile, so even though the JSON approach saves a few bytes, he's planning to run an experiment to get some data.

SyntaxNode commented 3 years ago

Used the tool at https://www.sisik.eu/proto to create a binary representation of the JSON above and it came out to be 1522 bytes.

I initially thought the 1522 bytes for Protobuf looked great against the 1441 bytes for JSON+GZIP, but I realized in my testing the Protobuf result is not Base64 encoded. When we add that in we get a less exciting 2028 bytes.

Using the same Baseline JSON model @bretg shared in his earlier comment, I've performed a comparison with several different formats. The first, JSON+Base64 is basically what we do today as a point of reference (but using the relative timestamps).

Write / Encoding

Format	Size	Write Speed	Write Memory Size	Write Memory Allocs
JSON+Base64	3032 bytes	7,372 ns	8,492 bytes	4 allocations
JSON+GZIP+Base64	1340 bytes	153,944 ns	823,966 bytes	28 allocations
Protobuf+Base64	2028 bytes	4,117 ns	5,632 bytes	3 allocations
Protobuf+GZIP+Base64	1372 bytes	149,189 ns	821,795 bytes	25 allocations
Protobuf+Brotli-Level-0+Base64	1456 bytes	37,251 ns	38,890 bytes	9 allocations
Protobuf+Brotli-Level-6+Base64	1380 bytes	308,450 ns	2,177,637 bytes	21 allocations
Protobuf+Brotli-Level-11+Base64	1296 bytes	5,292,766 ns	34,635,047 byes	55 allocations
Protobuf+LZ4+Base64	2012 bytes	52,670 ns	534,785 bytes	6 allocations

Read / Decoding

Format	Read Speed	Read Memory Size	Read Memory Allocs
JSON+Base64	30,641 ns	10,704 bytes	62 allocations
JSON+GZIP+Base64	56,016 ns	56,144 bytes	80 allocations
Protobuf+Base64	7,080 ns	7,336 bytes	69 allocations
Protobuf+Brotli-Level-0+Base64	39,060 ns	75,360 bytes	87 allocations

Protobuf+Base64 is the most efficient option and while the output is 33% smaller than what we use today that compares unfavorably to the other formats. If we are optimizing for both speed and size equally, this is the best option. However, if we want to optimize more for size and a little less for speed, I propose the Protobuf+Brotli-Level-0+Base64 format which is just ~8% larger than GZIP while 400%+ faster in my write benchmarks and 30% faster in my read benchmarks.

I experimented with using GZIP and Brotli for just the UIDs, but the overhead of both compression algorithms actually increased the final size. Similarly, I tested Protobuf+Snappy+Base64 but the Snappy compression increased the size a bit from just Protobuf+Base64.

Are there any other compression libraries you'd like to see me add to this benchmark comparison? Any suggestions must have Go and Java libraries.

SyntaxNode commented 3 years ago

Versioning

We need to represent the version outside of the encoded payload to determine which decoding approach we need to use. This approach needs to be backwards compatible with the current JSON+Base64 encoded format. We'll need to use a separate character not present in the Base64 URL character set, of which I think period "." is a good choice. This is the same choice made by JWT tokens and TCF2 consent strings.

Example:

2.i_cCAICqqqrq_1SRGhY3PdnRLwVQpwJLN11EoHKBhMhDQiVARRVU1qGOtq....

We should be fine for a long time with just using 1 character for the version followed by 1 character for the separator, but in the future we could extend the number of characters proceeding the separator if need be. The algorithm for version detection would be:

Is there a separator character present? If not, classify as version 1. This check can be optimized for validating if only the second character is the separator until we need to represent a higher version.
If the version is anything other than 2, then it's invalid.
If the version is 2 and there is a valid separator character, parse as this new format.

josepowera commented 3 years ago

Protobuf is great - but first cookie values "uid" should be split based on datatype... Since protobuf real record size is based on datatype - longs would be way smaller than same data written as string. Also bidder string name could be moved to some table with uint32 ID. If needed adapters would have to declare using string/long field data (it could be also done automatically based on content).

syntax = "proto3";
message uidCollection {
  message uidObject {
    string bidder = 1;
    string uid_string = 2;   //use uid_string or uid_long but not both
    uint64 uid_long = 3;
    uint32 expirationOffset = 4;
  }
  uint32 version = 1;
  uint64 expirationBase = 2;
  repeated uidObject uids = 3;
}

SyntaxNode commented 3 years ago

first cookie values "uid" should be split based on datatype

That's a good observation. There are several possible data types we could detect and optimize. You mentioned long in your example, there are also uuid, base64 encoded bytes, and hex strings. My hope is the compression layer on top of the protobuf binary encoding will solve for these storage inefficiencies without needing to add structure complexity. Let's test it. I'll use the protobuf structure you provided.

I replaced 25% of the entries in the Baseline JSON example with long values. This seems to be a slightly generous distribution based on the real world examples. The runtime complexity of both are close enough that I'll only list the sizes.

Format	Size
ProtbufString+Base64	1772 bytes
ProtbufString+Brotli0+Base64	1304 bytes
ProtbufOptimized+Base64	1700 bytes
ProtbufOptimized+Brotli0+Base64	1356 bytes

Without compression there is a 4% size reduction. With compression, there is a 4% size increase. It seems the compression algorithm fares a bit better with string numerics than with binary encoded numerics. This might not hold true for other potential optimized types though.

Also bidder string name could be moved to some table with uint32 ID

Yes, I agree that would provide a size savings. I'm going to test it with the following protobuf definition:

syntax = "proto3";
message uidCollection {
  message uidObject {
    uint32 bidder = 1;
    string uid = 2;
    uint32 expirationOffset = 3;
  }
  uint64 expirationBase = 1;
  repeated uidObject uids = 2;
}

Format	Size
Protobuf+StringBidders+Base64	2028 bytes
Protobuf+StringBidders+Brotli0+Base64	1456 bytes
Protobuf+IntBidders+Base64	1800 bytes
Protobuf+IntBidders+Brotli0+Base64	1320 bytes

Without compression there is a 11% size reduction. With compression, there is a 9% size reduction. I'd like opinions if this is worth the complexity of maintaining a list of bidder ids. This cannot be as simple as an alphabetical index since we need to account for added bidders, removed bidders, and different bidders between PBS-Go, PBS-Java, and forks.

bretg commented 3 years ago

Discussed in PBS committee. We're leaning towards the 'Protobuf+StringBidders+Brotli0-Base64' solution, with external version number.

bretg commented 2 years ago

Java results are at https://github.com/snahornyi/uids-java-tests

The team suggests using 'Protobuf+StringBidders+Brotli0+Base64' -- i.e. brotli and base64.

@SyntaxNode - please confirm that your table above meant to include Base64... i.e. was the minus sign a typo?

SyntaxNode commented 2 years ago

@SyntaxNode - please confirm that your table above meant to include Base64... i.e. was the minus sign a typo?

Confirmed. That is a typo. I'll fix it in my other comment.

bretg commented 2 years ago

There's a detail here I don't think we've ironed out: how the UID itself is represented.

I don't like the idea of having to configure for each adapter the datatype of their ID. Looking at the list of IDs in my current cookie, the pattern I see is that most are strings, and the ones that are ints are generally shorter (~20chars rather than ~40).

I suppose we could manage this in the /setuid code that creates the values:

scan the uid value
if it's all decimals and would fit within uint64, then place the value in uid_long
else place it in uid_string

It's easy enough on the read side to deal with this, but we'd have to agree to prefer one over the other in case somehow both of them get set.

SyntaxNode commented 2 years ago

@bretg Please review the conversation in this issue. The data type specific UID storage was discussed, explored, and ultimately rejected. We will be storing all UID values as strings.

SyntaxNode commented 2 years ago

This is the updated protobuf definition that I am proposing. I've followed the best practices from the protobuf style guide.

cookie2.proto

// Definition of Prebid Server's version 2 user sync cookie value encoding.

syntax = "proto3";

import "google/protobuf/timestamp.proto";

option go_package = "github.com/prebid/prebid-server/usersync";

message Cookie2 {
  google.protobuf.Timestamp expiration_base = 1;
  message UID {
    string bidder = 1;
    string value = 2;
    uint32 expiration_offset_days = 3;
  }
  repeated UID uids = 2;
}

I've added the go package option required for the code generator. Java would need to add their own Java options as well. Alternatively, we could provide the options directly to the code generator, but the best practices from Google on the matter are to include them in the file. I'm ok with making it as easy as possible for us to generate code for use in both PBS implementations.

This produces the following generated code structs for Go:

type Cookie2 struct {
  ExpirationBase *timestamppb.Timestamp `protobuf:"bytes,1,opt,name=expiration_base,json=expirationBase,proto3" json:"expiration_base,omitempty"`
  Uids           []*Cookie2_UID         `protobuf:"bytes,2,rep,name=uids,proto3" json:"uids,omitempty"`
}

type Cookie2_UID struct {
  Bidder               string `protobuf:"bytes,1,opt,name=bidder,proto3" json:"bidder,omitempty"`
  Value                string `protobuf:"bytes,2,opt,name=value,proto3" json:"value,omitempty"`
  ExpirationOffsetDays uint32 `protobuf:"varint,3,opt,name=expiration_offset_days,json=expirationOffsetDays,proto3" json:"expiration_offset_days,omitempty"`
}

bretg commented 2 years ago

Thanks @SyntaxNode - just to make sure we're on the same page on the /cookie_sync endpoint expectations... when it receives a cookie, it's going to have to first see if it's base64 encoded JSON. If not, then it passes the body through the brotli and protobuf decoders, right?

SyntaxNode commented 2 years ago

That's not what I had in mind. I proposed using a version prefix to make it easier / quicker for us to determine the correct code paths to use for decoding. In short, if it starts with "2." then remove those characters and decode the rest as a base64 encoded brotli compressed protobuf message. Else, consider it version 1 and error if there is a decoding problem.

I think the progression of ideas in this issue thread has muddled the intended proposal. I'll create a separate Google doc with the full proposed specs on Monday.

prebid / prebid-server

Shrink the PBS uids cookie #1985

Current Structure

Background on the current structure

Possible new structure

Baseline JSON

Protobuf

Write / Encoding

Read / Decoding

Versioning