invoke-ai / InvokeAI

InvokeAI is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, supports terminal use through a CLI, and serves as the foundation for multiple commercial products.
https://invoke-ai.github.io/InvokeAI/
Apache License 2.0
22.42k stars 2.32k forks source link

RFC: Structured metadata #266

Closed bakkot closed 1 year ago

bakkot commented 1 year ago

I have a proposal for a spec for metadata, laying out goals and a formal spec.

I'm happy to implement this if there's buy-in.

Thoughts?


RFC: Structured metadata

Currently when generating images from the CLI (but not the web), metadata for that is stored as a string kind-of corresponding to the prompt. That metadata is enough to reproduce the original image... sometimes.

I'd like to:

To that end, I'd like to propose the following spec for metadata.

In this doc, "hash" means "the first 8 characters of the hex-encoded sha-256".

Data location

Metadata is a JSON string following the "top-level data" schema, stored in an uncompressed PNG tEXt or iTXt chunk named "sd-metadata". (This corresponds to what PIL does already when adding text data - it will choose tEXt or iTXt depending on whether it contains non-latin-1 characters. I just figure it's worth writing this down.)

Top-level data

The top-level metadata should have the following fields:

and then also one of the following two fields, depending on whether this is a grid:

Image data

Every image has the following fields:

Images of type img2img also have the following fields:

Height/width are not stored since you can infer those from the file.

Thoughts on storing the model information

I am proposing to store a hash of the loaded model, which is a lot faster than reading the file from disk a second time, but the hash correspond to the file on disk. Better than nothing, though.

Is it worth also storing a hash of the model config? I don't think so, since you're always going to need the original config for a given model weights file.

psychedelicious commented 1 year ago

Perhaps fork could be a commit reference rather than a repo reference - if a repo changes its implementation, you may not be able to recreate the output.

bakkot commented 1 year ago

I'd love to include the current commit, but unfortunately a lot of users get the source by downloading the zip file from github, which does not include the hash anywhere, as far as I can tell (in particular it doesn't include the .git directory).

I guess it could be an optional field to be included if the code generating the metadata can reasonably figure it out, though.

fat-tire commented 1 year ago

Thought: Could this be coordinated with the URI definition standard request I proposed here? So that attribute names match, etc.? And how might they work together? Would a URI point to the image and a second argument produce this structured metadata?

fat-tire commented 1 year ago

Also, fwiw, the single-file, cross-platform PyQt5-based SD GUI I was contributing to already saves its settings in json, but I was realizing that the json settings actually define the image itself in a way.... Dunno if this is something to build on for a proof-of-concept as it's super easy to add to.

horses2

bakkot commented 1 year ago

@fat-tire URIs are inherently somewhat unsuited for structured data, and a full specification for a SD image (when you take into account stuff like variations and grids) is inherently structured. So if you want to do a URI, I think it would be best to have only a single key-value pair, where the value is JSON in the format of the spec proposed here. Then you don't need to try to coordinate two different formats for this specification.

psychedelicious commented 1 year ago

I'd love to include the current commit, but unfortunately a lot of users get the source by downloading the zip file from github, which does not include the hash anywhere, as far as I can tell (in particular it doesn't include the .git directory).

I guess it could be an optional field to be included if the code generating the metadata can reasonably figure it out, though.

Ah, right, didn't think about that. Optional field sounds good - and now that I think about it, if we are including a commit reference, we ought to include a branch reference as well (or does a commit imply a specific branch? I don't know).

fat-tire commented 1 year ago

@fat-tire URIs are inherently somewhat unsuited for structured data, and a full specification for a SD image (when you take into account stuff like variations and grids) is inherently structured. So if you want to do a URI, I think it would be best to have only a single key-value pair, where the value is JSON in the format of the spec proposed here. Then you don't need to try to coordinate two different formats for this specification.

This is good, except it makes the URI super long (is the json value in your keypair further encoded/compressed in some way?) and is effectively a wrapper around the json. I'm wondering if there's an abbreviated but human-readable and easily-edited format that could be used to reference an image(s) resource?

Like imagine an image browser for SD-- if the complete json would show up in View Source, what would go in the URL/Address bar? On a reddit post featuring some cool approach to generating images, you wouldn't attach a json file, but what if you could put a short self-contained link that could be copy/pasted or tweaked by hand by any non-technical person-- what would it ideally look like? Or say I wanted a single text file or google doc full of accumulated copy/pasted references images, say, for some art project-- what's the smallest, one-line-per-image (for example) way to do collect them? What might fit in a small QR code?

These are the types of use cases I'm imagining. I feel like a structured json file, even though it was designed to be interpreted by humans, may be too large and unreadable for non-programmers to easily understand and make changes simply and quickly and that a compressed URI with &param=value and sane defaults is more familiar. Maybe I'm talkin' crazy tho, dunno.

bakkot commented 1 year ago

Eh, they don't get that long unless you have a lot of data, and then, well, it is actually long. The data isn't compressed but there's simply not that much of it.

If you're really worried about length we could make some of the fields optional and specify default values - e.g. variations is assumed to be empty if omitted, etc. If we do that you end up with something like

{"ai-type":"stable diffusion","fork":"lstein","weights-hash":123456,"image":{"type":"txt2img","sampler":"k_euler","prompt":"the whole prompt goes here","seed":1234567890,"steps":50,"cfg_scale":0.7}}

which is pretty much as human-readable as a URL would be. More, arguably. There's not really a lot of overhead from the JSON format itself relative to URI-style k=v&k2=v2 style, just a few extra quotation marks (but fewer characters wasted on %20). And this really is the minimum information you need to unambiguously refer to a specific image.

I think copy-pasting things like the above is at least as easy as copy-pasting a URI, and has the benefit of not confusing people; URIs look like you should be able to point your browser at them, and that's not the case here. Plus URI encoding for spaces - which will come up a lot, because every prompt has spaces in it - is a huge pain for humans to deal with.

fat-tire commented 1 year ago

What you've said in principal makes sense to me, especially w/regard to %20 throwing people, but yes, sensible fallback defaults for omitted fields should be part of this standard to make it compact and easy for normal use by non-techies. However, as I think more about this, any defaults would need to be standard to a particular scenario-- ie,, they'd have to be "known" for consistency across apps supporting this metadata structure to agree to all fallback to defaults, and in the same way. But how would everyone know that say with the v4 model, "512x512 is the agreed-upon size default, so always assume that"? Is there some rule, like "the default size for any model is always the shape of the training images"? That seems too rigid. Who decides the defaults, and where would "sensible defaults" be published that anyone writing an application would know to find them?

Along the lines of defaults, I wonder if some fields should be deemed "required" vs "optional" or something-- the required ones would contain the bare minimum needed to generate an image-- people can then build out from there with greater specificity.

Also, you mention that height/width could be determined from the image (I assume, as this json is metadata embedded within an image) or when offered alongside it-- but if the json metadata alone is defined as sufficient to produce an image from scratch, I'd think you need to include width/height in the metadata. ie, don't assume anything should be inferred from the image, as an image may not be included or may have been resized or screenshotted or whatever (I acknowledge, I've expanded the use case beyond metadata in the file as originally conceived, but if this is a format also meant for copy/pasting in forums or via other human-means, it needs to be complete)

Also-- aside from the hash of the weights file, don't you want to have something to indicate about which version of a model is to be (or was) used, which at the very least would be helpful in providing feedback to the end user when they don't match or when a model is missing? Example-- SD is about to release v1-5 right? The model card already contains this metadata in the form of the model_id, such as "CompVis/stable-diffusion-v1-5"

Other random unformed ideas/thoughts:

Sorry there's a lot here and apologies if some of these don't quite relate to the lstein fork-- I'm only just getting familiar with the main repo. I don't wanna get to big and unweildy either, but I think some basics like H&W and model name/version have to be there if it's going to be used as a simple text-based generate-from-scratch "trading card", useful for learning, sharing, research, etc.

And I guess once everything is hammered out, if this turns out to be too heavy a way to do this, someone can always "URI-ize" it, especially once the field names are deemed stable and a good standard.

Anyway, this is pretty exciting stuff! Thanks.

ft

bakkot commented 1 year ago

Along the lines of defaults, I wonder if some fields should be deemed "required" vs "optional" or something-- the required ones would contain the bare minimum needed to generate an image-- people can then build out from there with greater specificity.

Yeah, I was imagining only certain field could be omitted. As to how people would know how to interpret missing fields, we'd write it down. So for examples you could omit grid, and that would be defined to mean false.

Also, you mention that height/width could be determined from the image (I assume, as this json is metadata embedded within an image) or when offered alongside it-- but if the json metadata alone is defined as sufficient to produce an image from scratch, I'd think you need to include width/height in the metadata

Yeah that's a good point. I'll update the OP to add those as fields which are optional only when the metadata is embedded in an image file from which it is possible to derive those values.

Also-- aside from the hash of the weights file, don't you want to have something to indicate about which version of a model is to be (or was) used, which at the very least would be helpful in providing feedback to the end user when they don't match or when a model is missing?

The hash of the weights file is sufficient to uniquely determine the model.

For the sake of giving users helpful feedback, rather than just "the weights file you have doesn't match", it would be kind of nice to also have a version to report, but we don't necessarily know that - for example, the installation instructions for this repo suggest putting the weights at models/ldm/stable-diffusion-v1/model.ckpt. There's nothing there to indicate what version of the model is in use, as far as I'm aware.

That said, I do think that having a list of known hashes would be helpful. That wouldn't be part of the metadata spec per se, though it might live along side it, and tools could hardcode that list to give more useful feedback when they see a hash they know.

"fork" -- maybe "variant"?

I'm fine with either name. I don't just want to use the mode card ID because the point of this field is to distinguish this repo from others. Including the commit hash or branch name would be nice, but as discussed in the comments above it's often not possible.

When grid=True, is it the grid image that has this metadata?

Yes.

Do the individual output images contain the tEXt chunks as well?

At least in this repo, there are not any individual output images. There's just the grid.

Does the grid image contain references to the names of the child images in /samples, and do the child images know about which grid image they belong to?

Per above answer, there are no such images. But if there were, the answer to both of these would be no.

Timestamp, contact_info, image_name, Description, Citation, Licensing, NSFW

For all of this stuff, I don't think it belongs in this spec - this is a specification specifically for the metadata for images to tell you the settings used to generate the image. If you want to include other information alongside the image, put it somewhere else. E.g., wrap the data from this spec: so { contact_info: 'whatever', name: 'whatever', metdata: { [this spec] } }.

That way we don't have to keep a registry of additional optional fields, which in my experience never works, and all of the fields in this spec can be automatically derived, which is important.

fat-tire commented 1 year ago

Yeah, I was imagining only certain field could be omitted. As to how people would know how to interpret missing fields, we'd write it down. So for examples you could omit grid, and that would be defined to mean false.

That makes sense, so long as everyone who is implementing this standard can agree on what the fallback defaults are.

Also-- aside from the hash of the weights file, don't you want to have something to indicate about which version of a model is to be (or was) used, which at the very least would be helpful in providing feedback to the end user when they don't match or when a model is missing?

The hash of the weights file is sufficient to uniquely determine the model.

Yeah, but it's only practical for confirming that the model you have is valid, which can be determined via other methods (usually whoever provides the model in the first place will offer a hash). Wouldn't it make more sense to name the model, version, and creator (the model_id from the card)? I mean, that's what it's for, it would make it easy to find the model or even the latest version of the model, and you wouldn't need to have to maintain a giant table of hashes and their names, versions, and sources.

That said, I do think that having a list of known hashes would be helpful. That wouldn't be part of the metadata spec per se, though it might live along side it, and tools could hardcode that list to give more useful feedback when they see a hash they know.

To me, that seems backwards-- the metadata should point to the model used, not the hash of the model- then you don't need any table that has to be maintained and updated (see below re generic model.ckpt). Validating the file's integrity seems outside the scope of what you would expect-- a regular image file doesn't offer the hash of the binary of photoshop that created it-- though it may have a string that indicates the tool and version used..

At the very least I'd expect a model_id should be included in addition to the model hash (for verifying integrity, I guess). I believe that's what the model_id is specifically intended for-- as a unique id meant to identify the model's name, origin, and version.

For the sake of giving users helpful feedback, rather than just "the weights file you have doesn't match", it would be kind of nice to also have a version to report, but we don't necessarily know that - for example, the installation instructions for this repo suggest putting the weights at models/ldm/stable-diffusion-v1/model.ckpt. There's nothing there to indicate what version of the model is in use, as far as I'm aware.

That's why I'm suggesting using model_id to indicate the version of the model. It seems to me to be a responsibility for the sd application implimenting this spec to know what model it's using (by name), not just offer a hash of whatever's there and say "good luck figuring out what this actually is!". Sure, maybe you can use the hash as a checksum to verify you have the right model once you know which model to use, when you're about to recreate the image.

To me, only providing the mdoel hash is a bit like say "to bake this cake, go to the store and buy the one ingredient they have in a red box that weighs 12 oz and costs exactly $13.54". Okay, I guess if I had a list of the store's inventory with associated prices I could find it, but it would have to be an always maintained, up-to-date list. Why not just tell me the exact ingredient I need, so I can ask for it directly? (Yes, as a safety, I can verify the product with the price and weight, but that's not how I want to look for it.) And going back to the error message for the user, it's a lot clearer to say "To bake this cake you need a 12oz bag of Whitman's Quality Flour." vs "Sorry, you are missing a product in a 12oz red box that costs $13.54." and cross-check that with a hopefully current list of all possible ingredients.

"fork" -- maybe "variant"?

I'm fine with either name. I don't just want to use the mode card ID because the point of this field is to distinguish this repo from others. Including the commit hash or branch name would be nice, but as discussed in the comments above it's often not possible.

But wouldn't a variant called lstein/stable-diffusion do exactly that? That is, it distinguishes this repo from fat-tire/stable-diffusion or others-it tells you where to get it, and what it is-- or maybe I'm missing something?

Do the individual output images contain the tEXt chunks as well?

At least in this repo, there are not any individual output images. There's just the grid.

I've not used lstein, but upstream when you create a grid with say, 4 images, you also specify the number of rows (which has a default) and you get 5 images back-- a "grid" image containing the four images arranged in, well, a grid, and the 4 individual images in the /samples folder contained within the outputs folder. So I think of the grid as more of a "preview" image, and if you want any image independently you can grab it from the /samples. It would be nice at one point to have the grid do smaller versions of the originals-- right now they are full-sized.

Does the grid image contain references to the names of the child images in /samples, and do the child images know about which grid image they belong to?

Per above answer, there are no such images. But if there were, the answer to both of these would be no.

Okay- sounds like someone took out the child images in the lstein fork... fwiw, if it's to be compatible with the upstream repository- might want the child images to look like any other generated individual image. Having an --n_iter of 1 and grid of trueI believe will give you BOTH a grid of one image and a child image in /samples, but don't hold me to that.

Timestamp, contact_info, image_name, Description, Citation, Licensing, NSFW

For all of this stuff, I don't think it belongs in this spec - this is a specification specifically for the metadata for images to tell you the settings used to generate the image. If you want to include other information alongside the image, put it somewhere else. E.g., wrap the data from this spec: so { contact_info: 'whatever', name: 'whatever', metdata: { [this spec] } }.

That way we don't have to keep a registry of additional optional fields, which in my experience never works, and all of the fields in this spec can be automatically derived, which is important.

Two more thoughts then, for possible expansion at a later time:

spec_version : int or string -- a way to identify version 1.0 of this spec to a version 10 years from now, for future backward compatibility future : object -- a place to attach more data that might become important later or to be used for "whatever" someone wants to stick in there.

bakkot commented 1 year ago

Wouldn't it make more sense to name the model, version, and creator (the model_id from the card)?

That information simply isn't available for most users, is the problem. Users just download a weights file and use it. So the hash of the weights file is literally the only information we can use here.

But wouldn't a variant called lstein/stable-diffusion do exactly that?

Yes, like I said I'm fine with calling this field "variant" instead of "fork". If you are asking for some other difference from what I've proposed, I don't know what it is you're asking for.

(We can't use the commit because that information often isn't available; I'd include it if I could, and I think I probably will add an optional field to store that.)

spec_version

Interesting thought. I'm fine with it because I'm not worried about extra space, but if we're trying to minimize fields it's not strictly necessary - we could just say that any new versions will add a new field. (I'd call it metadata_version, though.)

future: object a place to attach more data that might become important later or to be used for "whatever" someone wants to stick in there

I actually just modified the spec to include an extra field! But it's explicitly not supposed to be used for arbitrary other data, only things necessary to reproduce the image; I do feel that data not related to that goal doesn't belong in this metadata and should live somewhere else.

fat-tire commented 1 year ago

Wouldn't it make more sense to name the model, version, and creator (the model_id from the card)?

That information simply isn't available for most users, is the problem. Users just download a weights file and use it. So the hash of the weights file is literally the only information we can use here.

For that use case, a hash makes sense if you somehow need to reference an otherwise unknown model file. But looking forward-- there is a near-universal, industry-standard way of uniquely identifying a specific model, adopted across all disciplines of ui-- the model_id. To not include this string, even as an optional field, would be a major omission, IMO.

Any retrained models stemming from a single architecture would have its own name in most cases. If not for some reason, then fallback to the hash.

Also, when you say "users just download a weights file and use it"... how so? Presently, people are sophisticated enough to download stable-diffusion or a GAN or whatever but somehow have no idea what model they are using or where it originated? I get that right now sd has a single place to put a model named model.ckpt or whatever, but that model got put there by someone who had to understand where it came from and what the license was, etc. Using unknown binaries (even as a model) isn't a very good idea, generally speaking.

But wouldn't a variant called lstein/stable-diffusion do exactly that?

Yes, like I said I'm fine with calling this field "variant" instead of "fork". If you are asking for some other difference from what I've proposed, I don't know what it is you're asking for.

I'd suggest calling it model_id to be consistent with the customary nomenclature and to know you are referring to the model and not the codebase. And define it to correspond to a model_id from a model card. Make it optional, in case this info is unknown, in which case the hash would be a secondary way to try to identify it and a primary way to validate that the hash is the expected one.

For a specific codebase, maybe use variant or something more helpful like a pointer to where the application or its source can be found

This schema needs to account for both the codebase and the model, both of which have (a) a name, (b), a source/author, (c) a version/tag

Additionally, both can have hashes/signatures and I guess you want to track it for the model as you're accounting for scenarios in which a model's (a), (b), and (c) are all uknown but you still want to build the image from "scratch"

how about:

model_id -- string -- from the model card (generally contains a/b/c) model_hash -- to satisfy your use case of the unknown weights file

app_id (formerly app_variant) -- string -- this can be a description "lstein 1.2" or whatever human readable identifier for the code that you need to use the model" app_uri -- the remote repository used to build this. Could contain a URL which would include a/b/c, including the branch name. The same type of URL used, say, with git clone-- could be git:// https:// cvs:// svn:// file:// etc. If it's a closed-source program, this can link to a .zip file, an .exe, a .deb, an installer, a home page, a finger://, a deep link for mobile apps, or a web front end for a cloud service at huggingface, etc., or awasm application.

(We can't use the commit because that information often isn't available; I'd include it if I could, and I think I probably will add an optional field to store that.)

Well a commit is only a single change to the repository, and a single commit might be on multiple branches anyway. But a tagged branch is usually not meant to change-- adding branch or tag as an optional field might make sense.

spec_version

Interesting thought. I'm fine with it because I'm not worried about extra space, but if we're trying to minimize fields it's not strictly necessary - we could just say that any new versions will add a new field. (I'd call it metadata_version, though.)

Yeah I didn't mean literally spec_version as that's super ambiguous.

future: object a place to attach more data that might become important later or to be used for "whatever" someone wants to stick in there

I actually just modified the spec to include an extra field! But it's explicitly not supposed to be used for arbitrary other data, only things necessary to reproduce the image; I do feel that data not related to that goal doesn't belong in this metadata and should live somewhere else.

Sounds good-- you're underlining my primary goal-- to have something short and sweet to paste in something like reddit or wherever.

Imagine the stringified json object had a name, and for lack of a better term I'll just call it ailink for now.

So imagine a post that says "See this cool picture? if you paste this ailink in your GUI too you'll get the image I made"..

It would parse the data, and with a tap of a button start popping out the image(s) in as reproducible a way as possible.

Then maybe you just made the image but think it could be better. So you make a tweak to the prompt and get a great result. You'd hit the "Copy ailink" button and then paste THAT into your reddit thread. Or maybe just hit the copy to QRCode and paste that along with your beautiful image so people can see how you did it.

Similarly, someone sends you a cool image and want to know what exactly went into making it, you'd load in that image, hit the Get info shortcut key and see the same formatted, editable ailink info, exactly as it appeared in the above example. You make some more tweaks, copy the ailink (again, this probably needs a way better name), paste it in a post, and everyone can see your improvements and how you did it.

But as time passes, it may be the case that you grab someone's ailink and paste it and it goes-- oh dang, for this you need a specific upgraded model-- here's where you can get it." Or it may say "this image can only be built with version 2.1 of of this app-- here's where to get that".

Again, hope this is all making sense.

bakkot commented 1 year ago

For that use case, a hash makes sense if you somehow need to reference an otherwise unknown model file. But looking forward-- there is a near-universal, industry-standard way of uniquely identifying a specific model, adopted across all disciplines of ui-- the model_id. To not include this string, even as an optional field, would be a major omission, IMO.

Sure, I'm happy to have this as an additional, optional field, though no existing repository will be able to use it because no existing repository has the model card. (I guess I could, and probably should, hardcode specific hashes and their corresponding model cards, though.)

Also, when you say "users just download a weights file and use it"... how so?

This is an empirical description of the way people are currently using stable diffusion. They download the weights and use that. They are not currently in the habit of additionally downloading a model card, nor would I want to complicate the setup instructions by requiring that they do so.

It's not that people couldn't download this information alongside the weights, it's that right now they don't, and I really don't want to add additional setup burden. Were it me I'd've embedded this information in the checkpoint file, but as far as I'm aware this is not currently done.

model_id -- string -- from the model card (generally contains a/b/c) model_hash -- to satisfy your use case of the unknown weights file app_id (formerly app_variant) -- string -- this can be a description "lstein 1.2" or whatever human readable identifier for the code that you need to use the model" app_uri -- the remote repository used to build this. Could contain a URL which would include a/b/c, including the branch name. The same type of URL used, say, with git clone-- could be git:// https:// cvs:// svn:// file:// etc. If it's a closed-source program, this can link to a .zip file, an .exe, a .deb, an installer, a home page, a finger://, a deep link for mobile apps, or a web front end for a cloud service at huggingface, etc., or awasm application.

That mostly sounds reasonable - with the caveat that model_id, but not model_hash, would be optional - but I don't think having "app id" and also "app uri" is necessary; I think just saying "this is a string which identifies the app used" is sufficient, and I would leave it up to the app to decide how to do that. For repositories on GitHub, as almost all projects are, I think username/repo_name would be sufficient - that's what I'd use here - but it would be up to each codebase to decide the appropriate way to identify itself. And just one field is enough for that.

Well a commit is only a single change to the repository, and a single commit might be on multiple branches anyway. But a tagged branch is usually not meant to change-- adding branch or tag as an optional field might make sense.

A commit hash unambiguously refers to a single state of the code, which is the important part. Neither branches nor tags have that property, so I don't much care about them. But sure, I am fine adding repo_tag as additional, optional field, alongside repo_commit (or whatever).

Again, hope this is all making sense.

Yup, I think we're mostly on the same page.

I'll update the OP later today.

bakkot commented 1 year ago

OK, updated the spec in the OP. @fat-tire want to take another look?

fat-tire commented 1 year ago

Sure, I'm happy to have this as an additional, optional field, though no existing repository will be able to use it because no existing repository has the model card. (I guess I could, and probably should, hardcode specific hashes and their corresponding model cards, though.)

Sorry I'm a little confused-- I meant that the model card is just for humans to read to associate the model with an id-- so either it's bundled with an sd application, or the user has explicitly installed it, or the app has noted a specific model as a named requirement for recreating an image from its metadata.

Also, when you say "users just download a weights file and use it"... how so?

This is an empirical description of the way people are currently using stable diffusion. They download the weights and use that. They are not currently in the habit of additionally downloading a model card, nor would I want to complicate the setup instructions by requiring that they do so.

Oh of course not. They wouldn't need the actual card- but like you said, if there's a mystery .ckpt there, they could verify they have the "right one" for a-- (sigh) ailink by its hash as you suggest. But if they don't have it, and need to get it, that's where the id/uri come in.

It's not that people couldn't download this information alongside the weights, it's that right now they don't, and I really don't want to add additional setup burden. Were it me I'd've embedded this information in the checkpoint file, but as far as I'm aware this is not currently done.

Oh I never meant to suggest they would have to...

That mostly sounds reasonable - with the caveat that model_id, but not model_hash, would be optional - but I don't think having "app id" and also "app uri" is necessary; I think just saying "this is a string which identifies the app used" is sufficient, and I would leave it up to the app to decide how to do that.

But an app would have to know that it's incapable of supporting a particular model. I would suggest that model_hash would be optional if model_id is provided- but I guess it can't hurt to have a hash of the model for verification that it's the right one.

For repositories on GitHub, as almost all projects are, I think username/repo_name would be sufficient - that's what I'd use here - but it would be up to each codebase to decide the appropriate way to identify itself. And just one field is enough for that.

You mean for model_id right? the URL though would be needed to distinguish between "bob/mymodel" on github and "bob/mymodel" on gitlab or "bob/mymodel, tag release-2" on bitbucket or huggingface, or wherever else... I don't see how a URI wouldn't be indespensible. It even directs you directly to the specific tag and would be a virtual requirement for someone building from scratch who doesn't know the model or the ai community to know where to find the specific model required. (a direction to a commit hash would be good as well.)

Again, hope this is all making sense.

Yup, I think we're mostly on the same page.

I'll update the OP later today.

Cool thanks-- yeah it seems we're going in the same direction. Hopefully others will chime in as there may be uses cases or scenarios neither of us have contemplated.

Update-- took a look at the spec now-- looks great! My only holdouts are about the URI and possible confusion between github/gitlab/gitea/huggingface/etc. A URI pointing to a hash would clear up any ambiguity and offer clear direction for a user or automated process searching for the correct model or app w/o having to download and then check hashes.

Also to throw a wrench in this-- we're assuming cross platform support for the apps. I think we SHOULD care that we have the correct app, say some experimental new feature is supported here, but what if this won't work on my platform? What should happen in this case? Or maybe there's a Mac version of this windows program that WOULD work... what then? Maybe you're SOL- same as if you don't have enough memory or the right graphics card or whatever... or no?

again, nice work. Don't hate me-- I'm just playiing devil's advocate here :)

bakkot commented 1 year ago

Sorry I'm a little confused-- I meant that the model card is just for humans to read to associate the model with an id-- so either it's bundled with an sd application, or the user has explicitly installed it, or the app has noted a specific model as a named requirement for recreating an image from its metadata.

I'm thinking about how an application like this one would populate the metadata, not how a user would consume it. Right now, there is no reasonable way for an application like this to populate model_id, in general, without adding an additional step to the installation instructions which requires the person using the application to input that ID. And I don't want to add an additional step to the installation instructions. So we can't require model_id in the metadata.

I'm fine with having model_id as an optional field, and I've added it to the current draft. I just don't think it will get much use, because I don't know see applications could possibly populate it except for a few known model weights unless they ask users of the application to provide that information as an additional step.

I would suggest that model_hash would be optional if model_id is provided- but I guess it can't hurt to have a hash of the model for verification that it's the right one.

Making model_hash optional is only sensible if we can actually trust users' manually input model_id, and we definitely can't - I absolutely guarantee users will copy-past wrong or forget to update when switching out weights. The hash is something the application derives for itself, so it's trustworthy. It can't be optional.

You mean for model_id right?

No, I mean the app_id. I agree that in theory "username/repo_name" could be ambiguous, but having a convention that "username/repo_name" means specifically that username/repo on GitHub is fine. This is a convention other specifications use without issue. Requiring that you prefix the common case with https://github.com/ doesn't add any benefit except making the ID larger and harder for humans to distinguish at a glance, given such a convention.

I guess I am OK with having an extra app_url field, but it just seems like needless overhead to me.

Of course if your project is hosted somewhere other than GitHub you can put a full URL in the app_id. But many projects won't have a URL to use; many people have private forks they're tinkering with, and it's still useful to uniquely identify those. So I don't want to require a URL.

Also to throw a wrench in this-- we're assuming cross platform support for the apps. I think we SHOULD care that we have the correct app, say some experimental new feature is supported here, but what if this won't work on my platform? What should happen in this case? Or maybe there's a Mac version of this windows program that WOULD work... what then? Maybe you're SOL- same as if you don't have enough memory or the right graphics card or whatever... or no?

I don't think we're assuming cross platform support, really? We're just saying "here's how this was generated". Nothing is stopping you from inputting the same settings into a different application; you're just not guaranteed to get the same output.

again, nice work. Don't hate me-- I'm just playiing devil's advocate here :)

Not to worry: I work on a standards committee; I am extremely used to working with this kind of feedback. And it's helpful to getting the best version of the spec. Doing it before finalizing the spec is the best time for that!

fat-tire commented 1 year ago

I'm thinking about how an application like this one would populate the metadata, not how a user would consume it. Right now, there is no reasonable way for an application like this to populate model_id, in general, without adding an additional step to the installation instructions which requires the person using the application to input that ID. And I don't want to add an additional step to the installation instructions. So we can't require model_id in the metadata.

Well, for me this started when I was contributing to this qt-based GUI repo, and when looking at the settings file I was like- wait everything currently in the settings would be all you needed to define the image. It was already being saved as json and I was like-- this- this in some way IS the image and the user could effectively swap out settings files as different images and trade them...

I thought a URL-encoded version might be simpler, but regardless, that's where I started.

But if the-- and I'm going to stop calling it an ailink right now, but if this whatever-text is universal enough between programs, it has to be able to direct a user that loads the json into an app how to get anything that's missing. This appears to be the opposite of your scenario, where you're more concerned with how the application would create the metadata. I want to know how it can be consumed and then any missing components like the model (let alone a post-processing step or whatever) most clearly be addressed.

I'm fine with having model_id as an optional field, and I've added it to the current draft. I just don't think it will get much use, because I don't know see applications could possibly populate it except for a few known model weights unless they ask users of the application to provide that information as an additional step.

The first thing I plan to do once this is settled as a working standard is to implement it in that GUI, so I'll be using the model_id probably always, in concert with the hash. It can be populated by hand or as you suggest from a lookup table, or derived from checking the URI, etc.

I would suggest that model_hash would be optional if model_id is provided- but I guess it can't hurt to have a hash of the model for verification that it's the right one.

Making model_hash optional is only sensible if we can actually trust users' manually input model_id, and we definitely can't - I absolutely guarantee users will copy-past wrong or forget to update when switching out weights. The hash is something the application derives for itself, so it's trustworthy. It can't be optional.

Okay.

You mean for model_id right?

No, I mean the app_id. I agree that in theory "username/repo_name" could be ambiguous, but having a convention that "username/repo_name" means specifically that username/repo on GitHub is fine. This is a convention other specifications use without issue. Requiring that you prefix the common case with https://github.com/ doesn't add any benefit except making the ID larger and harder for humans to distinguish at a glance, given such a convention.

As long as there is a URI to clarify the actual source, to make it easy to find or download, that's fine.

I guess I am OK with having an extra app_url field, but it just seems like needless overhead to me.

Of course if your project is hosted somewhere other than GitHub you can put a full URL in the app_id. But many projects won't have a URL to use; many people have private forks they're tinkering with, and it's still useful to uniquely identify those. So I don't want to require a URL.

Okay, don't require it, but I do think it will prove to be extremely useful, especially as models splinter and are retrained, etc.

I don't think we're assuming cross platform support, really? We're just saying "here's how this was generated". Nothing is stopping you from inputting the same settings into a different application; you're just not guaranteed to get the same output.

That's fair enough.

Not to worry: I work on a standards committee; I am extremely used to working with this kind of feedback. And it's helpful to getting the best version of the spec. Doing it before finalizing the spec is the best time for that!

I agree!

lstein commented 1 year ago

I turn my back for a few days and this thread has grown to 18 comments!

Is the current RFC still the first posting, or is it an external file somewhere? It might be good to put it into the repository so that we can track version changes. Or even a Google doc.

tildebyte commented 1 year ago

Or even a Google doc.

Agreed, given the (seeming; I don't have time to read it all 😁) length and breadth of the discussion in here.

bakkot commented 1 year ago

I've been editing the original message; it's current with what I am proposing. You can look at the revision history if you want but it's not very exciting.

I might need to tweak it a couple more times - once to accommodate a richer variations format, for interpolations, and once to add a field for seamless, plus to note that individual applications may add additional fields if there is required information to generate an image. But other than those changes I'm happy with it, and I'll probably start implementing it (starting with a refactoring of the existing metadata code, probably in its own PR) later this weekend.

When I submit the PR implementing this format for metadata I'll include the spec as a markdown file and link to it from the readme. Don't want to formalize it before then because very few specs survive contact with first implementation, in my experience; I want to make sure it's at least possible to implement before I declare it good.

tildebyte commented 1 year ago

@bakkot; I'm sure that you know this, but if we also have a JSON schema, it makes everyone's life much easier (if nothing else, the schema IS the spec)

bakkot commented 1 year ago

I'll be sure to provide a JSON schema also.

tildebyte commented 1 year ago

@fat-tire; I hope I don't come across as rude, but "See this cool picture? if you paste this ailink in your GUI too you'll get the image I made" is pixie dust.

This fork is several hundred commits ahead of upstream (which btw for all intents and purposes is dead: in the last 2-ish weeks, it's had - README updates - fixes to deps - safety tweaks (ew) - a license change) - there are two other forks which are "leading" forks. This one is No. 2 in terms of stars and forks, but has twice the commits of No. 1.

I say all that to say: no WAY is someone going to be able to take unedited generation parameters straight from here, and run then through one of the other forks and get anything like the same image (this repo itself has issues with reproducibility on macOS!).

I don't mean to say that reproducibility isn't a goal, or should be ignored; more that it doesn't really exist now (across forks), and I don't believe that anyone is coordinating anything like it (across forks)

fat-tire commented 1 year ago

I guess I am OK with having an extra app_url field, but it just seems like needless overhead to me.

@bakkot you're still going to add a model_uri or model_url, as an optional field right? I see it as critical. Once the spec (or json schema) is good to go I plan to implement it in that quick qt-based GUI, and use the field to direct the user to download the model if model.ckpt is not there or does not contain the correct hash. Ideally, the app would auto-download & install it, but it needs to know where to find it. Relying on a up-to-date hash-lookup I think is not practical. If the model_url is missing (since it's optional) then it can try a table of hashes either in the app or some central place or just say "here's the has of the model that's missing, good luck finding it!"

I say all that to say: no WAY is someone going to be able to take unedited generation parameters straight from here, and run then through one of the other forks and get anything like the same image (this repo itself has issues with reproducibility on macOS!).

@tildebyte Of course no one is expecting you to be able to do that-- that is the very reason for providing the optional app_id field and app_version -- to tell you specifically how you CAN reproduce the image-- or at least what was used to create the image you're looking at. If you load it into the wrong program it says "No, that won't work. you need version 2.1 of XYZ app to create the image". I was hoping for an app_url too to save you from googling and so you'd know specifically which fork/commit/web site to get it from.

Incidentally, was trying to come up with a good name to replace 'ailink':

Meh. I'll keep thinking.

bakkot commented 1 year ago

I say all that to say: no WAY is someone going to be able to take unedited generation parameters straight from here, and run then through one of the other forks and get anything like the same image

I actually think it's feasible, believe it or not! This fork does have tons of features, but most of them are optional, and we've been pretty good about not making "breaking changes" in the sense of "changing the output for a previously-working prompt"; if you aren't doing anything "fancy" like upscaling or variations, you probably actually can get images to reproduce on other forks. Indeed part of the point of this spec was the hope that other forks would implement the same format and support loading metadata from other applications, when the metadata indicated the image was generated from the subset of features which they support.

bakkot commented 1 year ago

@fat-tire

you're still going to add a model_uri or model_url, as an optional field right?

Ah, right. I've just added model_url and app_url as optional fields.

magnusviri commented 1 year ago

I worked on the Mac seed problem and in the process I found this. Unless PyTorch promises reproducibility, there's no way anyone else can. The story of torch on the MPS (Metal Performance Shader, Apple's GPU acceleration) is that it can't reproduce randomness, either because of bugs or it just hasn't been coded yet. So we have the Mac switch to the CPU for randomness because it can reproduce results. But that is slower, so there's no way you'd want to do that on CUDA. So until torch and MPS can reproduce randomness, there's no way to have a Mac produce the same thing as CUDA. I am pretty sure that arm, intel, and amd CPU's will produce different results.

Until PyTorch and the CPU/GPU vendors come up with a way to reproduce results, I think we're just going to have to consider this "ai art" rather than "ai science", because it's something that can't be easily or exactly reproduced. I'm not trying to blunt your efforts, I think what you're doing is great. I just don't think you should get ahead of the ones who actually are in control of this.

tildebyte commented 1 year ago

@bakkot;

Indeed part of the point of this spec was the hope that other forks would implement the same format and support loading metadata from other applications

I absolutely agree, and I think someone (NOT IT) should try to reach out to basujindal & hlky to coordinate making this a cross-fork effort.

@bakkot, @fat-tire ;

This: "Unless PyTorch promises reproducibility, there's no way anyone else can". It's already a miracle that Windows users with NVIDIA H/W (and I presume Linux users as well) can repro previous generations...

fat-tire commented 1 year ago

:shrug: Even without complete reproducability across or in all platforms (for now), it seems to me that the spec reflecting as much info as possible for regenerating the image (or something conceptually similar to the image) seems like the right thing to include and would certainly still be useful right now for quickly looking up "how did they make that" and "how can I make something like that?" I can anticipate a future where this implementation issue is solved (NOT IT either), so having planned for it makes sense to me.

@bakkot -- you have model_hash in there twice :)

bakkot commented 1 year ago

It's a fair point that I need to capture at least some information about the GPU in the metadata at the moment, though. At least CUDA vs MPS.

you have model_hash in there twice

Ah geeze, thanks for point that out. Fixed.

tildebyte commented 1 year ago

@fat-tire;

Even without complete reproducability across or in all platforms (for now), it seems to me that the spec reflecting as much info as possible for regenerating the image (or something conceptually similar to the image) seems like the right thing to include

Absolutely agree

fat-tire commented 1 year ago

I started for fun just roughing in a GUI for implementing/editing this metadata, implementing img2img, etc. and have a few questions/observations--

is type really a function of an image object?? Wouldn't it be part of the main top-level structure? That is, you set the app to "txt2img" when creating the image, it's not a property of the image itself, right? I can see an argument that this indicates how the image was created, but all this will be embedded in the image metadata already-- and it's not a quality of the image itself, it's more of a quality of the mode the app was in when it made the image, no? Put another way, if you have a grid with multiple objects, they will all be generated identically, right? Share the same "type"?

Also, perhaps "type" is the wrong term. I think of an image "type" as perhaps whether it's a png, jpg, gif, etc. Ideas: script_type, flavor, generator... uh, that's all I got.

Also, for the grid field, does each grid image contain full metadata for each of the images it contains? (As opposed to each sample image from the grid containing all the information of all the other images.) Assuming the array of images is intended to be used for the the grid image only, might some data indicating WHERE each image is located be appropriate so they can be selected/extracted? Perhaps an optional field called origin or something indicating the top left pixel of the image relative to the grid image? Otherwise, maybe some kind of pointer to the /sample/image.png where the original can be found...?

For the actual grid boolean, could this true/falseness not be inferred from the number of image objects? Anything more than 1 would indicate a grid, correct? As there's nothing currently in the spec indicating whether n_rows was set, the origin + W&H would effectively give you this info. (incidentally n_rows is a little screwy it seems... but anyway)

Hope this makes some sense. Cheers!

bakkot commented 1 year ago

is type really a function of an image object??

It seems to me that it is, yes.

Another way of putting it is that the top-level structure holds values which are about the environment which generated the image, whereas the nested structure holds values that are about this particular image. And "was this txt2img or img2img" is about the particular image. (grid is a special case because it has to be top-level to make any sense at all.)

Put another way, if you have a grid with multiple objects, they will all be generated identically, right? Share the same "type"?

That happens to be true of the current codebase, but there's no reason it would have to be true, and I don't want to special-case the current codebase too much.

Also, perhaps "type" is the wrong term. I think of an image "type" as perhaps whether it's a png, jpg, gif

That's a "format", not a "type". I'm open to other terms, but of the suggestions you have, I still like type the best. I would be ok with switching to "kind" if you have a strong preference for that, though.

Also, for the grid field, does each grid image contain full metadata for each of the images it contains?

I'm not totally sure I'm understanding the question, so let me answer with a statement rather than yes/no: each entry in the images array contains a full Image object, such that if you replaced the top-level grid value with false and the images field with image containing that specific image's value from the images array, you would get the image in question.

Assuming the array of images is intended to be used for the the grid image only, might some data indicating WHERE each image is located be appropriate so they can be selected/extracted?

Ehhhhhhhh. I think saying that the array corresponds to the individual images in left-to-right, top-to-bottom order is sufficient.

For the actual grid boolean, could this true/falseness not be inferred from the number of image objects?

It could, but I am generally happier having explicit booleans for this sort of thing, unless there's a particular reason not to.

fat-tire commented 1 year ago

Another way of putting it is that the top-level structure holds values which are about the environment which generated the image, whereas the nested structure holds values that are about this particular image. And "was this txt2img or img2img" is about the particular image. (grid is a special case because it has to be top-level to make any sense at all.)

Ok.

Also, perhaps "type" is the wrong term. I think of an image "type" as perhaps whether it's a png, jpg, gif

That's a "format", not a "type". I'm open to other terms, but of the suggestions you have, I still like type the best. I would be ok with switching to "kind" if you have a strong preference for that, though.

Kind/type are about the same to me and have other uses in describing types of images. They're not intuitive, at least to me.... I think there's still perhaps a better-suited word. "method"? "operation"? I dunno. There will be more of these little scripts that create images in different ways so it should be as inclusive as I guess it can be.

Also, for the grid field, does each grid image contain full metadata for each of the images it contains?

I'm not totally sure I'm understanding the question, so let me answer with a statement rather than yes/no: each entry in the images array contains a full Image object, such that if you replaced the top-level grid value with false and the images field with image containing that specific image's value from the images array, you would get the image in question.

but this is meant to be included in a grid image-- that is, a large image with X rows and Y columns such as is currently output by sd, correct? If so, I am confirming that the individual images, currently produced in /samples, would not have the grid flag set and would also not have the full array-- only the data about that particular image? If it does NOT hav e the grid flag set and yet was produced by --grid, there is no way to reproduce that specific image as it was the output of a --grid argument and we have no way to know this. If it DOES have the grid boolean set in its "environment" flags-- which I thought might be reserved for grid images only-- do we rely on the grid image metadata containing an array of image metadata to know that it is a composite grid image?

Assuming the array of images is intended to be used for the the grid image only, might some data indicating WHERE each image is located be appropriate so they can be selected/extracted?

Ehhhhhhhh. I think saying that the array corresponds to the individual images in left-to-right, top-to-bottom order is sufficient.

But you can tell txt2img to usen_rows in generating the grid image and that's not explicitly in the metadata spec.... Hmm. I suppose you could calculate the # of rows and images by dividing grid_height/image_height and that's the number of rows... grid_width/image_height gives you columns. Of course there are some grid images that have black empty spaces since they don't fit nicely, but I guess you could account for that.

For the actual grid boolean, could this true/falseness not be inferred from the number of image objects?

It could, but I am generally happier having explicit booleans for this sort of thing, unless there's a particular reason not to.

Well for the reason above-- if the grid needs to be set true BOTH for the grid image AND the component /sample images (which a sample image would need to know how to produce it), then the easiest way to tell which is the grid is just to count the number of image objects. If it's more than 1, it's a grid.

OTOH I am getting the feeling that grid itself is intended to suggest the grid image. In which case, individual images wouldn't have it, and if they are part of a grid then how do we know?

Thought-- is this whole grid image thing, which part of SD currently, the best way to organize images generated together?

And a final thought-- I know that there's a new reduced GPU memory method for generating large images by breaking them up into pieces and rendering independently then I believe stitching them back together- I have no idea how this would affect anything in this issue, but it's a fun exercise to consider- would it be considered just an implementation detail that wouldn't be reflected in the metadata or is it thi kind of thing to include in a generated image's metadata, and where in the current design would be the right place to incldue it?

Idea: Instead of type, method?

bakkot commented 1 year ago

If so, I am confirming that the individual images, currently produced in /samples, would not have the grid flag set and would also not have the full array-- only the data about that particular image?

Yes, that's right.

If it does NOT hav e the grid flag set and yet was produced by --grid, there is no way to reproduce that specific image as it was the output of a --grid argument and we have no way to know this.

You don't need to know that it was the output of a --grid argument to reproduce the specific image. Perhaps I'm not understanding the question.

Let me give a concrete example: let's say you run prompt --grid --iterations=4, generating a 2x2 grid G.png containing the four images A, B, C, D, which are also written to A.png, B.png, C.png, D.png. The grid-image G.png would have metadata with grid:true, and its images array would have four entries. Each of the four A/B/C/D.png files would have their own metadata with grid:false. Then the metadata from G.png would be sufficient to reproduce G.png, and the metadata from A.png would be sufficient to reproduce A.png. You don't need to know that A.png was originally generated by a command including --grid in order to reproduce A.png.

if they are part of a grid then how do we know?

That information is not available. Why would it be? It's no more relevant than "this image was produced after trying a bunch of other prompts". It's not necessary to reproduce that specific image.

know that there's a new reduced GPU memory method for generating large images by breaking them up into pieces and rendering independently [...] would it be considered just an implementation detail that wouldn't be reflected in the metadata

It depends on whether it produces the same result as not doing that optimization. If it produces the same result either way, no reason to put it in the metadata. If it doesn't, a new field in the image structure would be appropriate.

Idea: Instead of type, method?

I still prefer type. I really do not expect there to be confusion here; once you see the value of the field it's obvious what it means.

fat-tire commented 1 year ago

If it does NOT hav e the grid flag set and yet was produced by --grid, there is no way to reproduce that specific image as it was the output of a --grid argument and we have no way to know this.

You don't need to know that it was the output of a --grid argument to reproduce the specific image. Perhaps I'm not understanding the question.

I was under the impression that images created as a grid had some kind of interdependency-- I would have to check the source to see how the images are generated, but when I tried to reproduce an image that was not part of a grid I got a different image. Maybe it's because I was under the assumption that the seed passed to the grid, when used for an individual image, did not result in any of the grid-generated. I guess I should go back and see how the 1 seed passed to grid is used to create 4 images...

Ah--- the answer seems to be pytorch's seed_everything. So I guess the "global seed" is used to create individual image seeds somewhere, which would have to be collected and then written into the individual images. Does this sound right?

Let me give a concrete example: let's say you run prompt --grid --iterations=4, generating a 2x2 grid G.png containing the four images A, B, C, D, which are also written to A.png, B.png, C.png, D.png. The grid-image G.png would have metadata with grid:true, and its images array would have four entries. Each of the four A/B/C/D.png files would have their own metadata with grid:false. Then the metadata from G.png would be sufficient to reproduce G.png, and the metadata from A.png would be sufficient to reproduce A.png. You don't need to know that A.png was originally generated by a command including --grid in order to reproduce A.png.

If you have that sub-seed or whatever you call it, yeah I guess this makes sense...

if they are part of a grid then how do we know?

That information is not available. Why would it be? It's no more relevant than "this image was produced after trying a bunch of other prompts". It's not necessary to reproduce that specific image.

Okay, I thought there was a dependency on the grid image, as I tried passing an identical seed to a grid and an image and got different results (which is to be expected)... didn't realize the master seed was being used to predictably generate seeds for the child images.

It depends on whether it produces the same result as not doing that optimization. If it produces the same result either way, no reason to put it in the metadata. If it doesn't, a new field in the image structure would be appropriate.

Idea: Instead of type, method?

I still prefer type. I really do not expect there to be confusion here; once you see the value of the field it's obvious what it means.

:shrug: okay...

bakkot commented 1 year ago

So I guess the "global seed" is used to create individual image seeds somewhere, which would have to be collected and then written into the individual images. Does this sound right?

Yup. The "global seed" is just the seed used for the first image, which is then used to generate a seed used for the second image after finishing the first image, and so on. So you don't need to keep track of two different seeds in the metadata, just the seed that was actually used for each image.

psychedelicious commented 1 year ago

If an initial image is used, we ought to include the filename and hash of that initial image.

bakkot commented 1 year ago

orig_hash is already included; I intentionally omitted the filename because it can be sensitive (e.g. a portrait's filename might name its subject), and I don't want to risk subtly embedding sensitive information into files.

codedealer commented 1 year ago

Hello, we're at https://github.com/sd-webui/stable-diffusion-webui (a repository formerly known as hlky) are interested at standardization of the metadata of generated images as well.

I keep periodically checking this thread to see if there's any progress.

bakkot commented 1 year ago

@codedealer Good to hear! I am unfortunately going to be offline for the next few days and didn't manage to get this in to this repo. However, I believe the OP is in a good state and is what I intend to implement here as soon as I get back; it should be good enough to be going on with. If you find during your implementation that it's underspecified or missing something you need, please do leave a comment and I'll update when I return.

lstein commented 1 year ago

Thanks for checking in! I am very supportive of adopting a common format. Please let Kevin and myself know if the OP is adequate and feel free to suggest improvements.

fat-tire commented 1 year ago

To keep the ball moving, I've tried to distill the above spec to a json schema. this is my first attempt at creating a schema, so I'm making no claim that this is usable or good. But for the sake of momentum, here it is anyway.

Caveats:

Proposed schema (based largely on the description at the top):

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://github.com/lstein/blob/main/stable-diffusion/airim.schema.json",
  "title": "AIRIM --- DRAFT PROPOSAL 9-8-22",
  "description": "AI-Rendered Image Metadata (AIRIM)",
  "type": "object",
  "properties": {
    "model": {
      "description": "General model category, usually 'stable diffusion'",
      "type": "string",
      "default": "stable diffusion"
    },
    "model_id": {
      "description": "String identifying the model. Must be the model_id field of a Model Card",
      "type": "string"
    },
    "model_url": {
      "description": "URL where the model can be downloaded (if public) or read about (if not)",
      "type": "string"
    },
    "model_hash": {
      "description": "SHA256 Hash of the weights",
      "type": "string"
    },
    "app_id": {
      "description": "Identifies the application consuming the model. It is recommended, but not required, that applications hosted on GitHub use the username/repo_name of the repository",
      "type": "string"
    },
    "app_version": {
      "description": "Projects with numbered versions are recommended to use a semantic versioned string of the form <major>.<minor>.<patch>[.<build number>].  If built from git repo, you may use the short-form git hash of the commit",
      "type": "string",
      "default": "unknown"
    },
    "app_url": {
      "description": "The canonical location of the application on the web",
      "type": "string"
    },
    "embeddings_hashes": {
      "description": "An array of the hashes of any textual-inversion embeddings in use",
      "type": "array",
      "default": []
    },
    "arch": {
      "description": "'cuda', 'MPS', or another helpful value indicating the GPU architecture",
      "type": "string",
      "default": "unknown"
    },
    "grid": {
      "description": "False if a single image, true if a grid",
      "type": "boolean",
      "default": "false"
    },
    "metadata_version": {
      "description": "Version of this metadata scheme",
      "type": "string",
      "default": "1.0"
    },
    "images": {
      "description": "An array containing a single (if grid is false) or more (if grid is true) image objects",
      "type": "array",
      "minItems": 1,
      "items": [
        {
          "method": {
            "description": "Either `txt2img` or `img2img`",
            "type": "string"
          },
          "postprocessing": {
            "description": "Either null, indicating no postprocessing was done, or an arbitrary object representing the postprocessing performed`",
            "type": "object",
            "default": null
          },
          "sampler": {
            "description": "String defining the sampler. Examples at https://github.com/lstein/stable-diffusion/blob/ed513397b255868a9c0afe6dd7e580005b5d32bb/scripts/dream.py#L302-L311 ",
            "type": "string"
          },
          "seed": {
            "type": "number",
            "description": "Either `txt2img` or `img2img`"
          },
          "variations": {
            "description": "Pairs used to generate variations",
            "type": "array",
            "default": [],
            "items": [
              {
                "seed": {
                  "type": "number"
                },
                "weight": {
                  "type": "number"
                }
              }
            ]
          },
          "steps": {
            "description": "Number of steps/iterations to generate image",
            "type": "number"
          },
          "cfg_scale": {
            "description": "Unconditional guidance scale: eps = eps(x, empty) + scale * (eps(x, cond) - eps(x, empty))",
            "type": "number"
          },
          "step_number": {
            "description": "Normally this will be the full number of steps, but for intermediate images it may be les`",
            "type": "number"
          },
          "width": {
            "description": "Image width in pixels",
            "type": "number"
          },
          "height": {
            "description": "Image height in pixels",
            "type": "number"
          },
          "extra": {
            "description": "Object containing any necessary additional information to generate this image. Not to be used for other data, like contact information.`",
            "type": "object",
            "default": null
          },
          "orig_hash": {
            "description": "[img2img only] Hash of the input image",
            "type": "string"
          },
          "strength_steps": {
            "description": "[img2img only] Strength for running img2img",
            "type": "number"
          }
        }
      ],
      "required": [
        "method",
        "sampler",
        "prompt",
        "seed",
        "steps",
        "cfg_scale",
        "width",
        "height"
      ]
    }
  },
  "required": [
    "model",
    "model_url",
    "model_hash",
    "app_id",
    "images"
  ]
}

Example near-minimal json example:

{
    "model": "stable diffusion",
    "model_id": "stable-diffusion-v1",
    "model_url": "https://github.com/CompVis/stable-diffusion/blob/main/Stable_Diffusion_v1_Model_Card.md",
    "model_hash": "fe4efff1e174c627256e44ec2991ba279b3816e364b49f9be2abc0b3ff3f8556",
    "app_id": "lstein/stable-diffusion",
    "app_url": "https://github.com/lstein/stable-diffusion",
    "app_version": "049ea02",
        "grid": false,
    "images": [{
        "method": "txt2img",
        "sampler": "ddim",
        "prompt": "a picture of an astronaut riding a horse",
        "width": 512,
        "height": 512,
        "steps": 50,
        "cfg_scale": 5.0,
        "seed": 1234567
    }]
}
fat-tire commented 1 year ago

Was bored so working on a very basic proof-of-concept that currently only outputs json for a single image. The generated json from this GUI front end DOES validate to the above in-progress schema, and FWIW the image I generated was reproduced in Linux using the CompVis upstream repo.

I haven't pushed this latest source with all the json/metadata stuff (and a couple UI/design fixes) pending people telling me I'm an idiot, but anyway the basic concept seems to work at least and that json is generated on the fly as the UI is messed with.

This should also detect lstein as opposed to CompVis and set all the app_id, app_url, etc. stuff correctly and run the scripts from the scripts/orig_scripts directory, though there is not yet UI support for the new lstein stuff like variations and extra samples, etc.

I know I'm getting way way ahead of myself here so I'm holding the source pending "what the hell are you even doing?", but it would seem the next steps would be to (1) add a "parse-this-json" button to go the other way, from the json -> the UI form. And of course (2) to actually save the metadata into the generated files. Oh yeah, i should probably (3) make the img2img fields work too.

One thing I learned from actually trying to implement this-- you can generate multiple images with the GUI, but I just can't yet include each of the images' data in the json-- I'm not really sure how to grab the individual seeds from the upstream repo in the case of a grid, which the spec demands since they somehow have to be reproducible per image, rather than the per grid. Maybe the solution is to re-play the seeding and selection of new seeds for each image? Or this may mean having to change the spec.... @bakkot ?

image

psychedelicious commented 1 year ago

@fat-tire If I understand your issue with seeds correctly, this issue we can fix very easily. The seeds are available to the scripts but the image outputs only include one of the seeds used in making a grid. Not sure why it was coded this way but it can be sorted on our end.

fat-tire commented 1 year ago

@psychedelicious but then the question is what to do with the upstream CompVis branch- is this "univeral" schema just not compatible with the original parent repository that spawned 'em all?

fat-tire commented 1 year ago

Also, a quick reminder to anyone thinking of implementing something like this that-- even though the json is human readable and all-- you always want to validate input as much as you can to make sure none of the fields can escape say a string to change/add arguments in the scripts that are run to do malicious stuff. (That said, I only check for quotes within the prompt as of now).

fat-tire commented 1 year ago

Not sure this issue is still active, but fwiw, here's a proof-of-concept for single-image text2img renders using the proposed-for-discussion json "AI-Rendered Image Metadata" (AIRIM) schema above.

Here you can see the image settings being set with the json being updated in real time. After a test render, I copy the json, set the seed & prompt to something different, then re-render a different image. Then I paste back the json, and the UI is updated with the correct values. Next, I screw up and forget to set the seed-randomizer. But once I turn it off and re-paste... boom, the original image is re-generated.

AIRIM.webm

Lordie it took me 2x longer to get this stupid screencast into a format that github would take then it did to implement the json and I'm not even kidding..

Caveats:

However...

Again, this is just a proof-of-concept and I'm not anticipating this GUI really being used anywhere besides my own messing around. I do like that it's cross-platform tho. Cheers, ft

Kyle0654 commented 1 year ago

I've noticed a few things I think might be missing here: