Open bretg opened 3 years ago
Ran an experiment with an online tool to see whether the uids cookie would be helped by protobuf. In short, I found we're be better off with compressed JSON. Here's what I did.
The contents of the JSON uids file with 20 IDs:
version: 2,
expirationBase: 1630698267,
uids: [
{
bidder: "osirjgosj",
uid: "3487rn8wu4niuesrgiuhreuhsiuerw9uersuj",
expirationOffset: 5
},
{
bidder: "oiuiuosr",
uid: "984579wmeg huibfsouisrhiuosrgnosriunsbru",
expirationOffset: 8
},
{
bidder: "oisoisvn",
uid: "5878 w9urhuhw9urs ehusierhisudr9suer9ms8e5u9ser8u",
expirationOffset: 8
},
{
bidder: "jhrubuyr",
uid: "9w4usireuiosroirusjsijroserijuhiuhuih8943eousxrf",
expirationOffset: 8
},
{
bidder: "nvxusruy",
uid: "98w4mt9wm49chwe8orghms8erhgiosuhthidstohjdotho",
expirationOffset: 8
},
{
bidder: "ajiuneuwy",
uid: "w8754mtc8whmg8uhsermg8uhwsr8gh8merhgoisduhtgosiuhrgoisuherg",
expirationOffset: 8
},
{
bidder: "ueusyeb",
uid: "q23uyhfoawyuehrg8myseh8rghwe85mctg8tsehrgm89chse",
expirationOffset: 8
},
{
bidder: "hgsdguaifu",
uid: "19u9473789a87sr8ymrc8uyrt8gdhiudth9se74984wu9s8euodisrdiohugiu",
expirationOffset: 8
},
{
bidder: "jhsdhjggr",
uid: "287w48w73ymt8s8drhisudhrf9s9ser589e89hr9fudo089re9e84uoisjosijopwkpk",
expirationOffset: 8
},
{
bidder: "sisdishi",
uid: "68eha84a8a9w4373f99393w89w734yw8468w64m8 syg48yesgm8 y4e8yseisyhf",
expirationOffset: 8
},
{
bidder: "lskjdhsifu",
uid: "q873847q6mw4h9m8 h4e97c4my9chmy48se7cy48c9s7ey49s7ce4y9se4759ses9",
expirationOffset: 8
},
{
bidder: "sttysufuhsf",
uid: "858958958989e976896e8868968989896e89yt9utw9u9ustuisgihg",
expirationOffset: 8
},
{
bidder: "yshdfbvishr",
uid: "7857w9579h97gw97w9763673675357452457427524756426752uywriyfwhwhgui4e",
expirationOffset: 8
},
{
bidder: "vaieirausv",
uid: "7547w56769869w4736667569000y9486565w3759090y687y636535646ujgi576twge",
expirationOffset: 8
},
{
bidder: "cusinuesinu",
uid: "64w86q4597y68997wq486e5086e79w68479e56898w7579wru95978w580w5s79",
expirationOffset: 8
},
{
bidder: "xargsr",
uid: "387s8745s8764t795es97s5708ws0845s68w4685w508ws97w97s5498se5987shghj",
expirationOffset: 8
},
{
bidder: "ssgdsrgs",
uid: "0s0roaisrej9sawrga00s8es47876sw674756aq57s7e4iusui9su89u8serh",
expirationOffset: 8
},
{
bidder: "zsrfgsrg",
uid: "78sdsuhgvuyvsotirjhsporiyaburiseung3ibaieyraiyureiaireauhiiuyiuaiuyrsibyurh",
expirationOffset: 8
},
{
bidder: "qsfrgs",
uid: "8s9rusvjrsiuniurnvs7s78erv8srvn8shrvsiyhrv8ysrvshifhbsiruhigru9eiru",
expirationOffset: 8
},
{
bidder: "fshyfjy",
uid: "98ae98ay4m9a8cm948ym94y9ah4eauyeh8aw7y38a73yw63476nt3486s4cgm87sc4my84m7",
expirationOffset: 8
}
]
This is 1441 bytes gzipped and base64 encoded. Note that I took care to minimize repeated strings so as not to favor gzip. When I used the same uid or bidder values, gzip did much better. :-)
Came up with this protobuf definition:
syntax = "proto3";
message uidCollection {
message uidObject {
string bidder = 1;
string uid = 2;
uint32 expirationOffset = 3;
}
uint32 version = 1;
uint64 expirationBase = 2;
repeated uidObject uids = 3;
}
Used the tool at https://www.sisik.eu/proto to create a binary representation of the JSON above and it came out to be 1522 bytes.
Discussed in PBS committee. @SyntaxNode has a hypothesis that protobuf would have the better CPU-performance profile, so even though the JSON approach saves a few bytes, he's planning to run an experiment to get some data.
Used the tool at https://www.sisik.eu/proto to create a binary representation of the JSON above and it came out to be 1522 bytes.
I initially thought the 1522 bytes for Protobuf looked great against the 1441 bytes for JSON+GZIP, but I realized in my testing the Protobuf result is not Base64 encoded. When we add that in we get a less exciting 2028 bytes.
Using the same Baseline JSON model @bretg shared in his earlier comment, I've performed a comparison with several different formats. The first, JSON+Base64 is basically what we do today as a point of reference (but using the relative timestamps).
Format | Size | Write Speed | Write Memory Size | Write Memory Allocs |
---|---|---|---|---|
JSON+Base64 | 3032 bytes | 7,372 ns | 8,492 bytes | 4 allocations |
JSON+GZIP+Base64 | 1340 bytes | 153,944 ns | 823,966 bytes | 28 allocations |
Protobuf+Base64 | 2028 bytes | 4,117 ns | 5,632 bytes | 3 allocations |
Protobuf+GZIP+Base64 | 1372 bytes | 149,189 ns | 821,795 bytes | 25 allocations |
Protobuf+Brotli-Level-0+Base64 | 1456 bytes | 37,251 ns | 38,890 bytes | 9 allocations |
Protobuf+Brotli-Level-6+Base64 | 1380 bytes | 308,450 ns | 2,177,637 bytes | 21 allocations |
Protobuf+Brotli-Level-11+Base64 | 1296 bytes | 5,292,766 ns | 34,635,047 byes | 55 allocations |
Protobuf+LZ4+Base64 | 2012 bytes | 52,670 ns | 534,785 bytes | 6 allocations |
Format | Read Speed | Read Memory Size | Read Memory Allocs |
---|---|---|---|
JSON+Base64 | 30,641 ns | 10,704 bytes | 62 allocations |
JSON+GZIP+Base64 | 56,016 ns | 56,144 bytes | 80 allocations |
Protobuf+Base64 | 7,080 ns | 7,336 bytes | 69 allocations |
Protobuf+Brotli-Level-0+Base64 | 39,060 ns | 75,360 bytes | 87 allocations |
Protobuf+Base64 is the most efficient option and while the output is 33% smaller than what we use today that compares unfavorably to the other formats. If we are optimizing for both speed and size equally, this is the best option. However, if we want to optimize more for size and a little less for speed, I propose the Protobuf+Brotli-Level-0+Base64 format which is just ~8% larger than GZIP while 400%+ faster in my write benchmarks and 30% faster in my read benchmarks.
I experimented with using GZIP and Brotli for just the UIDs, but the overhead of both compression algorithms actually increased the final size. Similarly, I tested Protobuf+Snappy+Base64 but the Snappy compression increased the size a bit from just Protobuf+Base64.
Are there any other compression libraries you'd like to see me add to this benchmark comparison? Any suggestions must have Go and Java libraries.
We need to represent the version outside of the encoded payload to determine which decoding approach we need to use. This approach needs to be backwards compatible with the current JSON+Base64 encoded format. We'll need to use a separate character not present in the Base64 URL character set, of which I think period "." is a good choice. This is the same choice made by JWT tokens and TCF2 consent strings.
Example:
2.i_cCAICqqqrq_1SRGhY3PdnRLwVQpwJLN11EoHKBhMhDQiVARRVU1qGOtq....
We should be fine for a long time with just using 1 character for the version followed by 1 character for the separator, but in the future we could extend the number of characters proceeding the separator if need be. The algorithm for version detection would be:
Protobuf is great - but first cookie values "uid" should be split based on datatype... Since protobuf real record size is based on datatype - longs would be way smaller than same data written as string. Also bidder string name could be moved to some table with uint32 ID. If needed adapters would have to declare using string/long field data (it could be also done automatically based on content).
syntax = "proto3";
message uidCollection {
message uidObject {
string bidder = 1;
string uid_string = 2; //use uid_string or uid_long but not both
uint64 uid_long = 3;
uint32 expirationOffset = 4;
}
uint32 version = 1;
uint64 expirationBase = 2;
repeated uidObject uids = 3;
}
first cookie values "uid" should be split based on datatype
That's a good observation. There are several possible data types we could detect and optimize. You mentioned long in your example, there are also uuid, base64 encoded bytes, and hex strings. My hope is the compression layer on top of the protobuf binary encoding will solve for these storage inefficiencies without needing to add structure complexity. Let's test it. I'll use the protobuf structure you provided.
I replaced 25% of the entries in the Baseline JSON example with long values. This seems to be a slightly generous distribution based on the real world examples. The runtime complexity of both are close enough that I'll only list the sizes.
Format | Size |
---|---|
ProtbufString+Base64 | 1772 bytes |
ProtbufString+Brotli0+Base64 | 1304 bytes |
ProtbufOptimized+Base64 | 1700 bytes |
ProtbufOptimized+Brotli0+Base64 | 1356 bytes |
Without compression there is a 4% size reduction. With compression, there is a 4% size increase. It seems the compression algorithm fares a bit better with string numerics than with binary encoded numerics. This might not hold true for other potential optimized types though.
Also bidder string name could be moved to some table with uint32 ID
Yes, I agree that would provide a size savings. I'm going to test it with the following protobuf definition:
syntax = "proto3";
message uidCollection {
message uidObject {
uint32 bidder = 1;
string uid = 2;
uint32 expirationOffset = 3;
}
uint64 expirationBase = 1;
repeated uidObject uids = 2;
}
Format | Size |
---|---|
Protobuf+StringBidders+Base64 | 2028 bytes |
Protobuf+StringBidders+Brotli0+Base64 | 1456 bytes |
Protobuf+IntBidders+Base64 | 1800 bytes |
Protobuf+IntBidders+Brotli0+Base64 | 1320 bytes |
Without compression there is a 11% size reduction. With compression, there is a 9% size reduction. I'd like opinions if this is worth the complexity of maintaining a list of bidder ids. This cannot be as simple as an alphabetical index since we need to account for added bidders, removed bidders, and different bidders between PBS-Go, PBS-Java, and forks.
Discussed in PBS committee. We're leaning towards the 'Protobuf+StringBidders+Brotli0-Base64' solution, with external version number.
Java results are at https://github.com/snahornyi/uids-java-tests
The team suggests using 'Protobuf+StringBidders+Brotli0+Base64' -- i.e. brotli and base64.
@SyntaxNode - please confirm that your table above meant to include Base64... i.e. was the minus sign a typo?
@SyntaxNode - please confirm that your table above meant to include Base64... i.e. was the minus sign a typo?
Confirmed. That is a typo. I'll fix it in my other comment.
There's a detail here I don't think we've ironed out: how the UID itself is represented.
I don't like the idea of having to configure for each adapter the datatype of their ID. Looking at the list of IDs in my current cookie, the pattern I see is that most are strings, and the ones that are ints are generally shorter (~20chars rather than ~40).
I suppose we could manage this in the /setuid code that creates the values:
It's easy enough on the read side to deal with this, but we'd have to agree to prefer one over the other in case somehow both of them get set.
@bretg Please review the conversation in this issue. The data type specific UID storage was discussed, explored, and ultimately rejected. We will be storing all UID values as strings.
This is the updated protobuf definition that I am proposing. I've followed the best practices from the protobuf style guide.
cookie2.proto
// Definition of Prebid Server's version 2 user sync cookie value encoding.
syntax = "proto3";
import "google/protobuf/timestamp.proto";
option go_package = "github.com/prebid/prebid-server/usersync";
message Cookie2 {
google.protobuf.Timestamp expiration_base = 1;
message UID {
string bidder = 1;
string value = 2;
uint32 expiration_offset_days = 3;
}
repeated UID uids = 2;
}
I've added the go package option required for the code generator. Java would need to add their own Java options as well. Alternatively, we could provide the options directly to the code generator, but the best practices from Google on the matter are to include them in the file. I'm ok with making it as easy as possible for us to generate code for use in both PBS implementations.
This produces the following generated code structs for Go:
type Cookie2 struct {
ExpirationBase *timestamppb.Timestamp `protobuf:"bytes,1,opt,name=expiration_base,json=expirationBase,proto3" json:"expiration_base,omitempty"`
Uids []*Cookie2_UID `protobuf:"bytes,2,rep,name=uids,proto3" json:"uids,omitempty"`
}
type Cookie2_UID struct {
Bidder string `protobuf:"bytes,1,opt,name=bidder,proto3" json:"bidder,omitempty"`
Value string `protobuf:"bytes,2,opt,name=value,proto3" json:"value,omitempty"`
ExpirationOffsetDays uint32 `protobuf:"varint,3,opt,name=expiration_offset_days,json=expirationOffsetDays,proto3" json:"expiration_offset_days,omitempty"`
}
Thanks @SyntaxNode - just to make sure we're on the same page on the /cookie_sync endpoint expectations... when it receives a cookie, it's going to have to first see if it's base64 encoded JSON. If not, then it passes the body through the brotli and protobuf decoders, right?
That's not what I had in mind. I proposed using a version prefix to make it easier / quicker for us to determine the correct code paths to use for decoding. In short, if it starts with "2." then remove those characters and decode the rest as a base64 encoded brotli compressed protobuf message. Else, consider it version 1 and error if there is a decoding problem.
I think the progression of ideas in this issue thread has muddled the intended proposal. I'll create a separate Google doc with the full proposed specs on Monday.
There are so many server-side adapters now that the PBS uids cookie has grown so large that it's starting to affect what other cookie values the host company domain can receive. e.g. my uids cookie is 3900 bytes. (!)
We need to address this.
Here are some values from my cookie:
Totally, I have 32 bidder entries my cookie, for an average of 121 bytes per bidder encoded.
(See below for a major update to the original proposal.)
Current Structure
The
expires
value is used to drop the value from the cookie so /cookie_sync will get an updated ID from that bidder.Structure of the current cookie:
Background on the current structure
Here's the comments from the code: (usersync/cookie.go) // "Legacy" cookies had UIDs without expiration dates, and recognized "0" as a legitimate UID for audienceNetwork. // "Current" cookies always include UIDs with expiration dates, and never allow "0" for audienceNetwork. // // This Unmarshal method interprets both data formats, and does some conversions on legacy data to make it current. // If you're seeing this message after March 2018, it's safe to assume that all the legacy cookies have been // updated and remove the legacy logic.
Possible new structure
Here's a straw proposal based on using a relative timestamp rather than absolute: