bakkot commented 2 years ago

I have a proposal for a spec for metadata, laying out goals and a formal spec.

I'm happy to implement this if there's buy-in.

Thoughts?

RFC: Structured metadata

Currently when generating images from the CLI (but not the web), metadata for that is stored as a string kind-of corresponding to the prompt. That metadata is enough to reproduce the original image... sometimes.

I'd like to:

be more precise about the metadata which gets stored
allow reproducing any output just from the metadata and necessary input files
- necessary input files meaning the model weights, the image for img2img, and the embeddings if using embeddings
- metadata should allow you to confirm that you have the right inputs, by storing hashes of all of those files
- "any" output includes outputs from seed fuzzing and interpolations (which I haven't written yet, in part because I wanted to work out the metadata format first)
store it in a structured format, namely JSON
expand the metadata so it works with grids
expand the metadata so it works with stuff like variations and interpolations

To that end, I'd like to propose the following spec for metadata.

In this doc, "hash" means "the first 8 characters of the hex-encoded sha-256".

Data location

Metadata is a JSON string following the "top-level data" schema, stored in an uncompressed PNG tEXt or iTXt chunk named "sd-metadata". (This corresponds to what PIL does already when adding text data - it will choose tEXt or iTXt depending on whether it contains non-latin-1 characters. I just figure it's worth writing this down.)

Top-level data

The top-level metadata should have the following fields:

model: "stable diffusion"
model_id: string identifying the model. must by the model_id field of a Model card. Optional; there is no default value, but consuming applications may infer a value from model_hash if they recognize that value.
model_url: a string giving a URL where the model can be downloaded (if public) or read about (if not). Optional, does not have a default.
model_hash: hash of the weights [precise format TBD depending on implementation feasibility]; see the "model information" section below
app_id: a string identifying the application consuming the model. It is recommended, but not required, that applications hosted on GitHub use the username/repo_name of the repository in this field; for example, the fork we're on would use lstein/stable-diffusion.
app_version: a string giving the version of the app from app_id. It is recommended, but not required, that projects with numbered versions use a string of the form v1.0, and that projects built from git repos use the short-form git hash of the commit. Optional, defaults to "unknown".
app_url: a string giving the canonical location of the application on the web. Optional, does not have a default.
embeddings_hashes: an an array of the hashes of any textual-inversion embeddings in use. Optional, defaults to an empty array.
arch: "cuda", "MPS", or another helpful value indicating the GPU architecture. Optional, defaults to "unknown".
grid: a boolean, whether this was a grid. Optional, defaults to false.
metadata_version: the string "1.0". Optional, defaults to "1.0". Breaking changes to this metadata format should update this field.

and then also one of the following two fields, depending on whether this is a grid:

image: an object in one of the formats specified below
images: an array of such objects

Image data

Every image has the following fields:

type: either "txt2img" or "img2img"
postprocessing: either null, indicating no postprocessing was done, or an arbitrary object representing the postprocessing performed. Spec for this will depend on individual postprocessors, but I'll write something up for the ones we support. Optional, defaults to null.
sampler: one of these samplers
prompt: a nonempty array of { prompt: string, weight: number } pairs. The single-prompt case is [{ prompt: prompt, wieght: 1 }]
seed: a seed
variations: an array of { seed: number, weight: number } pairs used to generate variations. Optional, defaults to an empty array.
steps: the number of steps configured to be taken
cfg_scale: the unconditional guidance scale
step_number: the number of steps actually taken. Normally this will be the full number of steps, but for intermediate images it may be less. Optional, defaults to steps (or strength_steps in the case of img2img).
width: the specified width (as a number of pixels). Optional only when this metadata is embedded in am image whose width is the same as this value would be, in which case it defaults to that image's width.
height: the specified height (as a number of pixels). Optional only when this metadata is embedded in am image whose height is the same as this value would be, in which case it defaults to that image's height.
extra: an object containing any necessary additional information to generate this image. Not to be used for other data, like contact information. Optional, defaults to the empty object.

Images of type img2img also have the following fields:

orig_hash: hash of the input image
strength_steps: the configured strength for running img2img (as an integer; as discussed here, that's what it actually is).

Height/width are not stored since you can infer those from the file.

Thoughts on storing the model information

I am proposing to store a hash of the loaded model, which is a lot faster than reading the file from disk a second time, but the hash correspond to the file on disk. Better than nothing, though.

Is it worth also storing a hash of the model config? I don't think so, since you're always going to need the original config for a given model weights file.

fat-tire commented 2 years ago

Perhaps processing instead of postprocessing with an ordered set of steps...? and include every step, including initial generation, if it's there. Agreed on the versioning.

Kyle0654 commented 2 years ago

As I've been refactoring things I've naturally ended up with something like this: [models.py](https://github.com/psychedelicious/stable-diffusion/blob/fba09fec82df9c440ce4bdacc7a463096faaba64/server/models.py#:~:text=class%20DreamBase()%3A,time%3A%20int).

I had wanted to break it into a list of processing steps, but that would require a lot more code to handle, so I held off.

I think it'd be nice if the processing steps were somewhat flexible. I could imagine even image load/preparation could be processing steps. E.g.:

steps:
- type: load_image
  inputs:
    filename: "xyz.png"
  outputs:
    image: "xyz"
- type: gfpgan
  amount: 0.7
  inputs:
    image: "xyz"
  outputs:
    image: "xyz-g"
- type: upscale
  amount: 2.0
  inputs:
    image: "xyz-g"
  outputs:
    image: "xyz-g-u"
- type: save-result
  inputs:
    image: "xyz-g-u"
  outputs:
    filename: "xyz-g-u.png"

So very roughly: take image xyz.png and expose it to the pipeline as xyz for this job. Then run GFPGAN on image xyz with strength 0.7, and expose the result as image xyz-g in the pipeline. Next, run upscale at 2x on image xyz-g, and expose the resulting image as xyz-g-u. Finally, save image xyz-g-u to a file named xyz-g-u.png (which would also likely signal a result being available).

Then elsewhere in the manifest it could include identifiers for each type of processor used (might want to vary types here), which would enable extensibility (even potentially pluggable extensibility, if there was enough flexibility).

This may get to be a lot of data to include in the metadata on a png though - I don't know how well it packs that field.

lstein commented 2 years ago

I am proposing to store a hash of the loaded model, which is a lot faster than reading the file from disk a second time, but the hash correspond to the file on disk. Better than nothing, though.

I'm almost done with an implementation that's integrated into this fork. However, I have questions about the hashes. First, are you sure you want only the first 8 characters of the sha256 hash? It seems a waste to calculate the thing and then throw away most of its digits.

Second, once the model weights are loaded into the torch object, how do you get them back out in order to calculate the hash? I am caching the hash to a file on disk adjacent to the weights file, so the long wait only happens the first time the hash is needed, but that first wait is long....

keturn commented 2 years ago

I appreciate the effort at standardization! I'm reading this with the thought of implementing it for :firecracker:diffusers. It looks like things map over pretty well.

One suggestion for the seed: have a way to specify the noise generation function. I guess that could either be in the same field, something like pytorch.randn(2364), or two separate fields. I've written one noise generation function so far, but I expect it to be more of a thing soon as we learn more about how the initial conditions influence the results. (Of course, if this is an extensible spec, we can always wait-and-see instead of throwing in the kitchen sink in version 1.0.)

Super nerdy bikeshedding suggestion: if you're going to be hashing large assets, consider something other than sha-256, e.g. meowhash, metrohash, or blake3. (This is totally optional and is probably only useful if sha-256 execution time is a problem or you're already using one of those other hashes in your application.)

keturn commented 2 years ago

Is there a place to distinguish between 16 and 32 bit versions of the model? Does that fit in to the model URL?

Kyle0654 commented 2 years ago

If we can somehow validate that the model matches what's at the URL, could we just use the URL and not include any other information? I know that may not be trivial given the file sizes, but it might be okay to validate once and save validation state (and file size/modification time, to check for file changes) so it's a one-time thing at initial run.

I've been considering splitting off a ModelProvider in my API anyway, to support multiple models at runtime (I don't know the performance implications or feasibility of that), or at least provide an interface to this sort of information.

lstein commented 2 years ago

After thinking about it a bit, I'm going to change the model loading code to read the weights into CPU memory, compute the sha256 hash, cache the results on disk (so that it doesn't have to be done each and every time), and then load into torch. Does that sound right?

UPDATE: It takes a while to calculate the hash, even after reading it from disk. About 10s on a 40 CPU 2.2 GHz HPC node. It might take a bit longer on a desktop PC, and I'm curious whether anybody will notice the slow down when I role out the change in a day or so. Fortunately it only happens the first time you load the model.

Kyle0654 commented 2 years ago

This actually gives me a good idea for restructuring the new code. If everything was a Processor, and processors could just be modules, it could open up development a lot (everything would be decoupled, you'd still be able to get access to Models/files/etc. by using dependency injection, etc.)...

psychedelicious commented 2 years ago

I think it'd be nice if the processing steps were somewhat flexible. I could imagine even image load/preparation could be processing steps. E.g.:

Agreed. To keep this from getting out of hand, should we allow only a single instance of each processing step?

If everything was a Processor,

This is how I'm thinking as well. Each module is structured such that it accepts image data and previous pipeline steps, and returns the new image data with its own metadata appended to the pipeline steps.

Processing modules have their own metadata format defining its own needed parameters and the formats it accepts and returns:

{
    "module": {
        "name": "GFPGAN",
        "version": 1.0,
        "home": "" // git repo? website? whatever
    },
    "parameters": {
        "strength": 0.7
    },
    "input_format": "", // one of "base64" | "PIL Image" | "file path" | ...
    "output_format": "", // one of "base64" | "PIL Image" | "file path" | ...
}

The module manager handles translating between these formats and throws a descriptive error if the version mismatches what is installed. Wonderfully decoupled, extensible, open...

edit: This repo provides the "Generation" module as well.

Kyle0654 commented 2 years ago

To keep this from getting out of hand, should we allow only a single instance of each processing step?

I don't know that I'd want to restrict it. I've already seen some clever implementations like one of the hd ones that does this process: generate, upscale, split, img2img each split, combine. I think I've seen other instances of multiple img2img runs as well, and grid generation and such.

I'd keep the inputs and outputs extensible though. I could imagine a scenario where you have two image inputs, for example, or multiple image outputs.

As long as each module can be reliably deterministic, it would make things pretty extensible. I can also probably wire it up so dependency injection works, so there'd be no manual wiring of things either (e.g. you could just say "I need access to files" in __init__ and it would work).

The challenge would be in the UI forming the pipeline to execute, unfortunately. I doubt you'll want a UI that's a straight representation of the pipeline in most cases (though maybe that'd be a nice UI to provide as an option).

psychedelicious commented 2 years ago

This is what I'm thinking for UI, just a flowchart really - is that what you mean as a straight representation?

Untitled_Artwork 8

Kyle0654 commented 2 years ago

Yah, something like that. Effectively a node system where you wire inputs and outputs - e.g. Unreal, Substance Designer, etc.

I think with all the major modules (generate, img2img, etc.) available, and a handful of transformation modules (e.g. image slicing, resizing, etc.) you might see people create some really unique stuff that we're not even thinking about here. =)

Of course, I don't know how to handle things like foreach, combine, etc. very cleanly in metadata =/. Maybe we just force things to static initially?

psychedelicious commented 2 years ago

Let's continue this broader architecture in a more visible discussion thread #597

Kyle0654 commented 2 years ago

One thing I wasn't certain about (that belongs here): what if you use a generated image that has metadata as the base image for something like img2img? Since we detect that metadata, should we include it in the new image? Is there some point where the metadata becomes too large?

psychedelicious commented 2 years ago

One thing I wasn't certain about (that belongs here): what if you use a generated image that has metadata as the base image for something like img2img? Since we detect that metadata, should we include it in the new image? Is there some point where the metadata becomes too large?

I don't think so. What if that initial image itself was made via img2img? We might be forced to recreate the big bang!

I say do not give any special handling to an init img, even if it has generation metadata - consider init images as atoms.

Kyle0654 commented 2 years ago

Maybe it could just be a url to it and a date if we have one? (or a filename and date).

psychedelicious commented 2 years ago

@bakkot did not include the filename for privacy reasons and instead opted for a hash: https://github.com/lstein/stable-diffusion/issues/266#issuecomment-1238755829

But that does not allow for metadata to be used to reproduce img2img images. You'd have to figure out which image was hashed. Maybe I'm missing something but that makes metadata worthless for img2img unless you remember which image you used.

IMO a filename and hash should be used. Privacy concerns are handled by the implementation e.g.t the back end copies init images to wherever they need to be and, if privacy mode is enabled, changes the filename to a UUID and strips metadata from them.

fat-tire commented 2 years ago

IMO a filename and hash should be used

Maybe a URL instead of a filename or path? Even if it's file:// would be nice to reference an image resource accessible via http:// or ssh:// or whatever.

psychedelicious commented 2 years ago

Perhaps we have filename (e.g. my_init_image.png), location (e.g. path/to/my_init_image.png or https://www.website.com/my_init_image.png or whatever is appropriate), and hash.

fat-tire commented 2 years ago

Since file://path/to/image.jpg is a legit URL just like https://www.example.com/image.jpg, maybe just a single image_url field would work, which could cover both "filename" and "location" and tell you how to get it. In the case of file:// it might prefer relative over absolute paths (to avoid inclusion of user or account names). There is of course a difference between posix-style paths and windows-style paths, but that's easily translated, especially for relative paths.

I was also just thinking that a nice thing about the node-based pipeline description of how to generate an image is that if img2img is used with an image which in turn contains metadata about how it was generated, that image could theoretically be imported and hooked into the graph to be recreated from its metadata too.

psychedelicious commented 2 years ago

I'd like to question grid being relevant. Why do we need that? Yes, grids were in the initial scripts, but they aren't an inherent part of SD or other generation technology. Grid belongs to the presentation, not the image itself.

codedealer commented 2 years ago

I'd like to question grid being relevant.

I agree. Also not clear on the purpose of variations in image_data, why does it matter which variations were created from that image or is it necessary for the reproduction?

The prompt field currently doesn't account for so called "negative prompts" which is different from prompts with negative weights, see: https://github.com/sd-webui/stable-diffusion-webui/discussions/999

Lastly I'd like to voice my concerns about putting file paths of any nature into metadata. Privacy aside this is too unreliable a feature in my opinion. If we allow (and I think we should) these images to be shared across the community the paths (both relative and absolute) can change arbitrarily but it shouldn't impede reproducibility of an image in any case. The original spec already has orig_hash and that should be enough to verify that the image supplied is init_image. Yes it falls onto a user to remember which image was the original and to supply it with the shared one. An alternative to that would only be embedding the entire init_image into metadata itself.

Kyle0654 commented 2 years ago

We may want to consider including a handful of fields in the file metadata (separately from this) to indicate metadata spec version and format. I can imagine we'll eventually want to gzip or otherwise compress the metadata (if not come up with a binary format).

bakkot commented 2 years ago

The PNG format where we're sticking this metadata already supports gzip'd text in tEXt and iTXt chunks, marked by a bit in the header for that chunk, so that's already future-proof.

(We almost certainly do not want to come up with a binary format.)

psychedelicious commented 2 years ago

Ok, so suppose the client has the responsibility of keeping track of init images. Aren't we, in practice, deferring a part of the spec (association of init image to result image) to client implementation?

Scenario 1: I have generated a lot of images via SD img2img. My computer crashes but thankfully I had a data backup. I reinstall whatever software I used to create the images. How does my software figure out which init images go with which results?

Works

Embedding init image

non-cryptographic file identifer (filename) + including the files

database keeping track of everything (if it wasn't lost in the crash)

Scenario 2: I have generated a lot of images via SD img2img. A new, vastly improved UI is created and I want to migrate to it. How does that happen?

Works

Embedding init image

non-cryptographic file identifer (filename) + including the files

Doesn't work

database keeping track of everything (new software has had to roll their own way of keeping track)

Scenario 3: I have generated a lot of images via SD img2img. My friend wants to iterate on my work. They don't use the same software I use. How do I send them my best results and include the init images?

Works

Embedding init image

non-cryptographic file identifer (filename) + including the files

Doesn't work

database keeping track of everything

Scenario 4: A new img2img method is invented in which an arbitrary number of init images are provided and you get an cool mix of all of them. How do we indicate which images are used?

Works

array of non-cryptographic file identifers (filenames) + including the files

database keeping track of everything

Doesn't work

Embedding (arbitrary number of files)

I understand that not including a reference to the init image besides a hash may be "correct", but I don't think it's functional. We're not building a metadata spec to be correct, we're building it to be used in the real world, right?

psychedelicious commented 2 years ago

After making a tea and doing some testing, I think embedding the init images as base64 is probably a good enough solution. I embedded 50x base64 images in a PNG's metadata without issues writing or reading the data back. The PNG is now 34 MB but well that comes with the territory.

Edit: according to this official-looking website, the max chunk length is a Very Large Number™️. So we can almost fit an abritrary number of init images. http://www.libpng.org/pub/png/spec/1.2/PNG-Structure.html

fat-tire commented 2 years ago

To save space, perhaps only include the images in leaf/terminal nodes as any intermediate images (flipped, rotated, combined, tiled, etc.) should be able to be derived from those, right?

codedealer commented 2 years ago

@psychedelicious I'm all for embedding an arbitrary number of images inside metadata as base64 if they all are needed to regenerate the image. You are going to need them anyway, whether they come packed into one file or several. Sharing/uploading just one file is easier in terms of general UX.

As a side note: why was this marked as completed? Doesn't feel like a conclusive solution was achieved?

Kyle0654 commented 2 years ago

It may be a good idea to provide a way to get the images with the metadata stripped out, especially if they're significantly larger because of it.

psychedelicious commented 2 years ago

The client could handle exporting an image without metadata e.g. "Share Image" vs "Share Image with SD Metadata"

So when init images (or masks or anything else that is invented) get embedded, we will need to strip them of their metadata, else when you chain img2img's, you end up with massive metadata. This goes back to considering init images and any other input to the current working image as atoms which come with no context of their own. Hope that makes sense

psychedelicious commented 2 years ago

I appear to have a "reopen" button that works. I have used it. This must have been closed by mistake, @lstein was doing some out of season spring cleaning.

fat-tire commented 2 years ago

I was thinking-- so we're going to embed required images but not, say, embed the actual weights file, right? (of course not)

Since multiple images can share a single weights file, it's reasonable that maybe one init image will be shared between images too. There's a value in including that init file as a "standalone" image, but if you're grabbing 50 images that all share the same init image, you don't want the size of that image repeated 50x.

So maybe the notion of "static" vs "shared" (as in libraries) might be applicable... just to make things simpler (or maybe more confusing).

Maybe to manage such scenarios, have something like;

resources: Array of resource-- can be 0-N of them

A resource would have: resource_id : string : this is the reference "handle" that will be used by the pipeline nodes in referencing this resource. Required and must be unique. Could be a hash of the file, but then we won't need the next bit. resource_hash: string : Assuming this is a sha512 hash of the final binary-form of the resource, they can be tracked to see if it is locally available (as a file or even within other files), and if not, retrieved "on demand". resource_mime: string: A description of the type of resource ("image/jpeg") : optional, default would be maybe just "image/*" resource_url: string: A URL pointing to the resource. Can use any scheme-- ssh://, file://, http://, etc. Optional. resource_content: base64 encoded resource binary. I presume this gets compressed. Very optional.

I'm probably missing something, and as always it's important to consider security implications of throwing a big blob in there.

I figured call this resource instead of image as to not confuse it with the final image(s) that are produced from the pipeline and for expandability as perhaps someday such resources will include more than just plain images.

In fact, the weights file itself can be seen as a resource-- not that you'd shove that 10G file in an image-- but instead of doing this:

model: "stable diffusion" model_id: string identifying the model. must by the model_id field of a Model card. Optional; there is no default value, but consuming applications may infer a value from model_hash if they recognize that value. model_url: a string giving a URL where the model can be downloaded (if public) or read about (if not). Optional, does not have a default. model_hash: hash of the weights [precise format TBD depending on implementation feasibility]; see the "model information" section below

You could just make it a resource :

resource_id : CompVis/stable-diffusion-v1-4" from model card resource_hash : [hash of stable-diffusion-v4 weights file] resource_mime: model/pytorch (or something. I can't find any MIME types for model checkpoints) resource_url: https://huggingface.co/CompVis/stable-diffusion-v1-4 or a file://path/to/model.ckpt (?) resource_content: null [you get it separately. It's not in the image.]

One advantage to this is that for a pipeline with nodes, you might be using several models-- stable-diffusion to generate the image, then ESRGAN or something else to do more processing.

Anyway, this is just typing out loud, so maybe none of this is good... dunno. Maybe we're trying to do too much all at once. But it can't hurt to think a few steps ahead about what may be possible so that it's not THAT hard to redo later.

codedealer commented 2 years ago

So when init images (or masks or anything else that is invented) get embedded, we will need to strip them of their metadata, else when you chain img2img's, you end up with massive metadata. This goes back to considering init images and any other input to the current working image as atoms which come with no context of their own. Hope that makes sense

I can't know what will be invented in the future but at least in regards to img2img I don't expect the size of the generated PNG to be sufficiently large (unless upscaled). An image that is the result of a chain of 100 img2img generations still needs to embed only the 99th image because that is the only one that is needed to regenerate it.

I don't propose to store all of the chain in the init_image field, only the previous one, even if that one itself embeds an init_image it should be stripped out. Only the actual image data of the initial image's PNG is relevant for the regeneration not how said image was produced (it could have been generated with SD or downloaded from a hosting or it's a photo from a phone it shouldn't matter).

psychedelicious commented 2 years ago

Yeah we are suggesting the same thing here. I brought it up in reference to a past conversation somewhere on this repo in wihch this same question was raised i.e. if we embed/store an init image as metadata for a result image, should we store that init image's metadata.

lstein commented 2 years ago

I am going crazy. I cannot see this discussion in GitHub GUI. The only way I can find it is to manually type the full URL. I've also tried to pin it, but it doesn't show up. Does someone understand what's going on here?

invoke-ai / InvokeAI