nats-io / nats-architecture-and-design

Architecture and Design Docs
Apache License 2.0
177 stars 20 forks source link

Object Store Discussion #57

Closed scottf closed 2 years ago

scottf commented 2 years ago

Discussion moved to https://docs.google.com/document/d/13RF06NCzRPBOW_es1pKUqQyThMBlGGUz170Y1xuls8Q/edit#

Overview

Working up to an ADR for object store.

Feature Requests

Stream Conventions

Stream Name

???

Chunk Subject Name

$O.%s where %s is a subject meaningful to the file / blob, maybe the file/blob id (see meta data)

Meta Data

File / Blob Meta Data

File / Blob meta data in json form or as message headers

Options being discussed for where to store this:

Nothing prevents this data from being stored elsewhere.

key: value example Description
id: myuid string, some unique id
name: myfile.txt string, name for a file / blob
description: blah blah string, description
date: 363823843 int64, unix date for a file
contentType: text/plain string, mime type
chunks: 42 int64, count of chunks
chunkSize: 8192 int64, number of bytes per chunk (uncompressed payload size)
lastChunkSize: 999 int64, the last chunk is most often not exactly the same size as all the others
length: 987654321 int64, total length of data
Digest: sha-256=base64stuff as in http, <digest-algorithm>=<digest-value>

Chunk Meta Data

Chunk meta should be included as headers on each chunk.

key: value example Description
id: CHNK1 string, some unique id for this chunk. Maybe not necessary?
blob-id: BLB1 string, id of the file or blob meta this chunk belongs to if blob meta is used
length: 987654321 int64, length of unencoded chunk
chunk-number: 42 int64, 1 based chunk number. should match sequence. Redundant maybe.
start: 344064 int64, 0 based offset, first byte in chunk in full blob. Useful for random access
encoded-length: 12345 int64, length of data when encoded, this is what the payload length will be. An extra check.
Content-Encoding: gzip as in http
Digest: sha-256=base64stuff as in http, <digest-algorithm>=<digest-value>

Other Considerations

Pre-chunked data

Consider streamed video is already broken up into individual chunks which can be retrieved in a random access fashion. A similar storage mechanism can be used, but there needs to be a way to know what each specific record (message) is. There might be an index piece of data that stores the timestamp of the chunk along with it's sequence number. Alternatively you could extend a subject by $O.<subject>.<chunkIdentifier> giving the ability to subscribe specifically to that chunk. Don't know if this is efficient i.e. to have that many subjects or it's just better to deal with the sequence. Either has tradeoffs when using / creating subscriptions / consumers to retrieve the specific part.

aricart commented 2 years ago

@derekcollison The requirement to generate a digest is problematic. In some languages, the entire data must be available for the digest to be calculated (Go can do it without buffering the entire contents, but for example none of the web crypto APIs work that way). This would mean the digest shouldn't be required.

Digest inclusion should be an option and calculated by the writer if desired. Also the decoration of the digest algorithm on the hash should instead be a field in the ObjectInfo, with the hash value being its base64 URL encoding. Alternatively, each of the chunks could have a digest as a header entry, presumably a client that can read a message chunk will have the data in memory for the chunk while handing it off to the application.

derekcollison commented 2 years ago

I think that functionality is pretty important.

This not a solution in the TS/JS world?

https://stackoverflow.com/questions/18658612/obtaining-the-hash-of-a-file-using-the-stream-capabilities-of-crypto-module-ie

aricart commented 2 years ago

While node may have a solution, things like browsers won't. So if it is a requirement, the client will have to implement a steamed version. Streaming the data is fine, but if the object is large non steamed crypto operations will spike or oom the process.

https://developer.mozilla.org/en-US/docs/Web/API/SubtleCrypto/digest

scottf commented 2 years ago

Closing. See ADR PR https://github.com/nats-io/nats-architecture-and-design/pull/66