Conversion to and from mixed-endian byte strings

ploeh commented 5 years ago

Microsoft tends to encode UUIDs in a mixed-endian format.

"Other systems, notably Microsoft's marshalling of UUIDs in their COM/OLE libraries, use a mixed-endian format, whereby the first three components of the UUID are little-endian, and the last two are big-endian."

Source: Wikipedia

There's plenty of evidence of this. Ask me how I know 😉

It'd be useful if the uuid library also provided conversions to and from this format. I created this conversion to ByteString:

toMixedEndianByteString :: UUID -> ByteString
toMixedEndianByteString uuid =
    case BS.unpack $ toByteString uuid of
      [w0,w1,w2,w3, w4,w5, w6,w7, w8,w9, wa,wb,wc,wd,we,wf] ->
        BS.pack [w3,w2,w1,w0, w5,w4, w7,w6, w8,w9, wa,wb,wc,wd,we,wf]
      _ -> BS.empty

I've yet to attempt the reverse conversion, but I think it'll look similar.

Is there any interest in getting this into the library? If so, I'll be happy to attempt a pull request.

hvr commented 5 years ago

I've been pointed to https://docs.microsoft.com/en-us/previous-versions/aa379358(v%3Dvs.80) which claims

typedef struct _GUID {
    unsigned long Data1;
    unsigned short Data2;
    unsigned short Data3;
    unsigned char Data4[8];
} GUID,  UUID;

so your code above would only be correct when the host order is little-endian.

What is your use-case for this serialization format? is it for C FFI purposes or something else?

ploeh commented 5 years ago

My use case is reading column data from SQL Server. For that, I'm using the odbc package. This package has, however, no particular representation of a UUID, so instead, for UNIQUEIDENTIFIER columns, you just get a ByteString. The same applies when saving data to such a column: you must supply a ByteString value.

I've noticed that when I use toByteString to convert a UUID value, when I save it to the database, the bytes in first three parts are reversed.

Other people have made corroborating observations.

The explanation could be that

"The first 4 parts are either 2 or 4 bytes long and are therefore probably stored as a native type (ie. WORD and DWORD) in little endian format. The last part is 6 bytes long and it therefore handled differently (probably an array)"

and

"since the last 8 bytes are stored as a byte array, I think this identifies the behaviour you are seeing."

Source: https://stackoverflow.com/q/10190817/126014

When I convert the bytes using the above toMixedEndianByteString function the value gets correctly stored in the database.

hvr commented 5 years ago

@ploeh I see; however in this case I'd advocate that it should be the database library's responsibility to know how to decode/encode the types supported by the respective database; and in fact, that's what e.g. postgresql-simple does. However, I can't bring this up myself at https://github.com/fpco/odbc/issues as I've been banned by FPComplete.

ploeh commented 5 years ago

I don't mind taking the issue to odbc instead. Ultimately, I can just keep my working solution in my own code base, where it already works. I did think that I'd ask here first, though, since this might be a problem with UUID values marshalled via any Microsoft-based system.

As the Wikipedia entry suggests, this could be an issue with any UUID you receive via COM/OLE, so it's likely to be much wider than exclusive to interacting with SQL Server. I haven't tried, but it's possible one might run into similar problems when interacting with, say, Microsoft Office, Exchange, or many other older systems of that type.

As I did spend a few hours figuring all this out, I thought I'd offer the solution at the place where it'd be most generally available to other users, thereby saving others from similarly wasted time.

hvr commented 5 years ago

If you get this encoding via OLE/COM, this means via FFI, now? In that case you'd typically not get it via a ByteString but rather as a Ptr and then we should rather talk about the Storable API. I'd like to see more real-world use-cases beyond ODBC to better inform how to design and add this into the uuid package.

ploeh commented 5 years ago

That's a good point; I hadn't thought that through. It's true that when interacting with the odbc package, I take advantage of the feature that already turns SQL Server's native UNIQUEIDENTIFIER into a ByteString. The code that does that, however, does get the data via a Ptr.

haskell-hvr / uuid

Conversion to and from mixed-endian byte strings #48