jhc13 / taggui

Tag manager and captioner for image datasets
GNU General Public License v3.0
495 stars 26 forks source link

image preprocessing functions #185

Open geroldmeisinger opened 1 month ago

geroldmeisinger commented 1 month ago

168

implement different functions (prepare_img_stretch_and_squish, prepare_img_scale_and_centercrop, prepare_img_scale_and_fill) to prepare square images for the model. functions are not yet integrated in UI.

added unit tests for prepare_img functions. call with python taggui/run_tests.py which runs the functions on images/people_landscape.webp and images/people_portrait.webp. images are licensed CC, added attributions.txt. I chose images with an aspect ratio of 2 and people as content to make any strange preparations more obvious ("vertically flipped faces?", "oval faces?", "double face reflections?", "extended feet?")

added opencv-contrib-python as requirement (used in scale_and_fill). sooner or later we will need it anyway. I strongly argue for using opencv-contrib (contrib + non-headless) as there is a known issue with opencv to be incompatible with any other module previously installed in parallel and this version is the superset of all opencv modules.

TODO: add examples with abstract objects (like diagram)

Landscape example

original:

people_landscape

stretch and squish: people_landscape_stretch_and_squish

center crop: people_landscape_scale_and_centercrop

fill gray: people_landscape_gray

fill noise: people_landscape_noise

fill reflect: people_landscape_reflect

fill replicate: people_landscape_replicate

Portrait Example

original: people_portrait

stretch and squish: people_portrait_stretch_and_squish

center crop: people_portrait_scale_and_centercrop

fill gray: people_portrait_gray

fill noise: people_portrait_noise

fill reflect: people_portrait_reflect

fill replicate: people_portrait_replicate

geroldmeisinger commented 1 month ago

CogVLM2-4bit, 512 tokens, no sampling

  1. prompt: please make a technical description of this image. mention any artifacts you see.
  2. please make a technical description of this photo of people. what are the shapes of their faces and bodies? do you notices anything strange on their faces and bodies? mention any weird occurences or artifacts in the image! please describe the quality, texture and color of the image in a technical manner.

outputs.zip

notable mentions: landscape_noise: The image has a grainy texture, which could be due to the quality of the camera used or the resolution of the image. There are no visible artifacts in the image that would indicate corruption or damage to the photo. landscape_reflect: The symmetry created by the reflection adds a dynamic element to the photograph, making it visually appealing., The image depicts a group of individuals sitting on a sandy beach. The photo is taken from a slightly elevated angle, capturing the group's reflection in the water below. The reflection in both the sand's surface and the calm water below is clear and undisturbed, indicating a lack of wind or other disturbances at the time the photo was taken. The symmetry of the reflection adds a sense of balance and harmony to the composition. landscape_replicate: focusing on the subjects in the foreground while softly blurring the background elements. (does it mean the depth-of-field or the replication?)

portrait_noise: 5. Artifacts: There is a noticeable artifact in the form of a grainy texture overlaying the image. This could be a result of the image being compressed or saved at a lower resolution, which can affect the clarity and detail of the photograph. 9. Details: Despite the grainy artifact, portrait_reflect: This image is a mirrored, symmetrical representation of a crowd of people in an urban setting. The image appears to be a mirrored composite of two different photographs, creating a mirrored effect across the center... portrait_replicate: The image has been heavily distorted with colorful, blurred, and overlapping stripes, which appear to be digital artifacts. These artifacts obscure the details of the image and make it difficult to discern the original content.

=> don't use noise, reflect and replicate plain colors seems to be ignored, which is good.

extra prompt on black:

is there a border in the image and what does it look like? what color is it? The image does not have a traditional border that frames the photo. Instead, it appears to be a part of a larger image or a screenshot from a digital platform, as indicated by the visible interface elements at the top and bottom of the image. The top shows icons that resemble social media sharing options and a timestamp, while the bottom has a navigation bar with icons for going back, home, and accessing options or settings. The color of these interface elements is primarily white, with some black and gray accents.

hmm...

extra prompt and stretch_and_squish:

what is the shape of their heads? does the shape of their faces look natural? landscape: The heads of the individuals in the image are mostly rounded with a slight forward-facing curve, which is typical for natural human heads. The shape of the faces appears natural, with no noticeable elongation or distortion.

portrait: The heads in the image are mostly rounded with a variety of hair lengths and styles. The faces appear natural, with a mix of different facial structures and features, including a range of nose shapes, mouth sizes, and chin lengths. There doesn't seem to be any noticeable alteration or manipulation to the facial features, maintaining a realistic appearance.

CogVLM2 seems to be trained on squished heads?

landscape_extra_squishy: landscape_extra_squishy The shapes of the faces and the bodies of the individuals in the image are quite regular and natural. There are no noticeable abnormalities or unusual features on their bodies or faces. They appear to be young adults, and their features are typical of that age group.

portrait_extra_squishy: portrait_extra_squishy The shapes of the faces and the bodies of the individuals in the image are quite varied. Some faces are more angular, while others are more rounded. The body types range from slim to more robust. There are no immediately noticeable strange features on the faces or bodies; however, the image is quite crowded, and individual features can be obscured due to the overlapping nature of the people.

geroldmeisinger commented 4 weeks ago

added preprocessing functions to settings but only those which don't change content:

settings

geroldmeisinger commented 4 weeks ago

@jhc13 I'm done. I only integrated it for CogVLM2 because it's the only model I use and know that uses square images. I cannot integrate it for other models.

geroldmeisinger commented 2 weeks ago

I have to admit it is rather difficult for me to come up with real adversarial examples:

  1. image manipulation of faces:

people_landscape_adversarial

I'm so sorry children but it's for science!

  1. diagrams were circle versus ellipse or square versus rectangle have different meanings

diagram

from https://commons.wikimedia.org/wiki/File:EPK_komplexes_Beispiel.png

geroldmeisinger commented 6 days ago

I just noticed some models are using preprocessor_config.json