nagadomi / waifu2x

Image Super-Resolution for Anime-Style Art
http://waifu2x.udp.jp/
MIT License
27.44k stars 2.71k forks source link

[Idea] Training using vector graphics #121

Open Anthony-Gaudino opened 8 years ago

Anthony-Gaudino commented 8 years ago

From what I could understand training happens using bitmap images, like PNG.

I imagine that while training each image has it's dimensions changed, either reduced or increased. The problem is that resizing bitmap image is never perfect, no matter the algorithm (I believe waifu2x uses lanczos3).

With this, I propose the idea of using vector graphics for training, vector can be scaled without quality loss, so from one vector file you can generate high resolution images and smaller ones for training, and both wont have the problems of scaling bitmap images because they will be scaled in a vector software and then saved as PNG.

Using vector software, you may also be able to create random images, images with text, squares, circles, gradients, etc for training.

Of course this is limited to artistic style images, photos are always bitmap and will still have to suffer from the bitmap resizing.

leilei- commented 8 years ago

Vector strokes are often way too uniform though

nagadomi commented 8 years ago

Using vector graphics for training might be works better in simple graphics. But the typical input image is bitmap and it contains noise or detailed effects. I think vector graphics is too clear/simple for real world problems.

Here is my opinion of current state of waifu2x for this topic(and sorry for my poor english): waifu2x uses lanczos/sinc/box downscaling algorithm for generating low resolution images. Typically, donwscaling algorithm provides better quality then upscaling algorithm. But it is not perfect. You are right. Therefore, current state of waifu2x soloves inverse-downscaling problem instead of pure superresolution problem. Obviously, inverse-downscaling strongly depends on the downscaling algorithm which is used at training process. This is not very good. One of my solution is to reduce dependence of the downscaling method using random choice of filter types(e.g. lanczos,sinc,box,..) and filter parameters(e.g. blur). Additionally, I think that most of the image on Internet was resized. This is reason why is inverse-downscaling not very bad.

Anthony-Gaudino commented 8 years ago

Yes, vectors will never have the fine details of photos but some are quite detailed, like this one.

One can also convert a photo to vector using a software like AutoTrace, Potrace or Delineate but will lose the details.

If someone want a detailed image that can be resized without problems, then there's the option of 3D rendering on different resolutions, but I think this would take way too long to process.

Also, fractals came to mind, they can be very detailed and resizing don't introduce problems.

nagadomi commented 8 years ago

First of all, I really need the true downscaling algorithm. It improves superresolution task and its benchmarks. (Unfortunately, results of superresolution benchmark also depends on a downscaling algorithm which is used at generating benchmark dataset.) But I am still unable to obtain it.

3D rendering is good idea, but it still depends on a shader and models. I heard that a user tried training with computer generated arts. It looks good. but I am guessing that it is not able to beat the bitmap training in benchmarks.

Anthony-Gaudino commented 8 years ago

I couldn't understand what you meant by saying "true downscaling algorithm", do you mean a perfect downscaling algorithm? Does this algorithm exists?

What are you still unable to obtain? The "true downscaling algorithm"?


The good thing about 3D is that having a single scene you can generate unlimited images by just zooming, moving and rotating the camera. You can change parameters like colors, etc too. So you don't really need a different scene for every image you want.

I believe that you can save a lot of time if you instead of rendering the image, just use a preview, Mitsuba, for example, has an OpenGL preview that is quite good and it just take a few seconds to render. Example bellow:

Preview on 1024x1024: 1024x1024

Preview on 512x512: 512x512

Downscaled 1024x1024 preview to 512x512 on GIMP with Lanczos3: 512x512_gimp

If you open the 512x512 images and zoom, you can see the small changes between them.

I don't know if OpenGL previews are 100% safe methods or not (will generate equivalent images on different resolutions), someone that understands better about 3D graphics should know.

I also tried using Mandelbulder which can render 3D volxels, but I noticed that changing the render dimensions has side effects on lighting, I will check if there's a fix for this.

nagadomi commented 8 years ago

I couldn't understand what you meant by saying "true downscaling algorithm", do you mean a perfect downscaling algorithm? Does this algorithm exists?

I was meant to be "perfect downscaling algorithm". out This is downscaling results, from left to right, Box filter, Triangle(aka bilinear) filter, Catrom(aka bucubic) filter, Lanczos filter. Box is sharp. Triangle is blurry. Catrom makes ringing a little(white line around mouth). Lanczos makes ringing. so neural networks learn how to repair this. For example. When using box filter in training, upscaling results will be blurry a little(inverse effect of sharp). When using lanczos filter in training, upscaling results will be noisy and sharp(strange results).... :disappointed:

3D

I think MMD(miku miku dance) is useful tool for this. lots of models and motion are available. But when training with 3DCG, upscaling model will specialize for 3DCG.

Anthony-Gaudino commented 8 years ago

I don't know if there will ever have a perfect downscaling algorithm, I believe that there's a limitation.


I understand the problem better now... Since someone may want to upscale an image that was previously been downscaled by an algorithm like box or lanczos, then the neural network should know how to upscale those.

I tough that training with only perfectly downscaled images, like vector or 3D rendered, would produce better results in all cases.

Specialized training

Would it be better to have specialized training?

Lets assume that you have images that were not previously downscaled (original photos, vector, 3D rendered images), those will be used for training. Then:

The user will have the option to select if the image he wants to upscale was previously downscaled, if yes, ask which filter was used, if the user don't know the filter or don't know if it was downscaled, the software could guess the best model to use or use the ALL IN ONE model. If the image was not downscaled then the program will use the perfect downscaled model.

I guess that using the specialized models would provide better results than the ALL IN ONE for their specific situation.

3D

MMD seems to produce graphics more similar to vector art, while Mitsuba renders into photo like images. Mitsuba scenes are stored in XML files, so it should be easy to change them with a script and render using the fast Virtual Point Light integrator.

nagadomi commented 8 years ago

I guess that using the specialized models would provide better results than the ALL IN ONE for their specific situation.

This is right. Learning based upscaling method strongly depends on a training dataset. A training dataset is made by image domain(anime style art, photo, 3DCG,..., etc) and downscaling methods(box, lanczos, vector graphics, ..., etc). And current state of waifu2x uses all-in-one donwscaling method approach for each image domains. so waifu2x has pretrained model for anime_style_art and photo. And we can add pretrianed model for Mandelbulder.

If I publishing more specialized pretrained models, it might be useful for user specific situation. A problem is training time. training one model takes around 16 hours. models/anime_style_art` contains scale,noise1,noise2,noise3,noise1_scale,noise2_scale,noise3_scale, total 7 models and it takes around 5 days. I will add more detailed documentation for training (train.lua has lots of heuristic parameters). and I will add a feature that is able to use user specific pairwise(highres/lowres) data.

Anthony-Gaudino commented 8 years ago

I would like to help training but unfortunately I don't have a CUDA capable hardware. If I had, I would train more specialized scenarios as I described.


I opened issue buddhi1980/mandelbulber2#101 to ask about the Mandelbulber lighting problem I noticed, it seems that to fix this problem rendering times would be much longer.


I still think that perfect downscaling training would be better with 2D vector and 2D fractal if you want faster training. Vector would be more likely better used to art style and fractals to photo.

3D rendered images on Mandelbulber take more than one minute to render and on Mitsuba (Virtual Point Light integrator and 10000 SPP) about 15 seconds, of course this depends on resolution. Both could be used for photo and 3DCG.

I'm now almost certain that Mitsuba renderings are equivalent, the only problem being that light emissive surfaces don't have anti-aliasing, so maybe they should be hidden behind an object.

buddhi1980 commented 8 years ago

In general you have to be careful about using raytracers to generate different sizes of images. If they use anti-aliasing algorithms, they are always selective and mostly focus on object edges. So different parts of image will be filtered in different way. What is more interesting, that parts which will be anti-aliased will be just calculated as an average of locally increased resolution of set of pixels (very similar to downscaling). Even if you turn on full image antyaliasing then it will start to behave exactly the same as down scaling of images using GIMP. So as I know how 3DCG is generated you will not get any benefit from rendering of images in different resolution in compare with down-scaling of images.

Anthony-Gaudino commented 8 years ago

@buddhi1980, my understanding of 3DCG is limited, but I don't agree 100% with you. 3D rendering software like Mitsuba can simulate the real world lighting very well and when you render an image in lower resolution, it's not limited to the pixels data available to an image downscaling software.

I'm using the simplest rendering method available on Mitsuba, but still, I can see that rendering in lower resolution can produce better results than downscaling.

You are correct, this method is not perfect, virtual lights are created randomly, that's why I used a huge number of them (10000), so different renderings of the same scene should produce almost equal rendered images, to have better results, one can increase the number of VLs even more, this is similar to path tracing, using a huge amount of SPP will increase the chances of having more similar rendered images of the same scene.

You are also correct about the use of anti-aliased images, but the Virtual Point Light integrator can generate anti-aliased images without using an anti-aliasing filter, actually the available anti-aliasing filters are not used even if activated. The Virtual Point Light integrator renders the same image for every virtual light and combines them, this creates a kind of automatic anti-aliasing, it seems that only light emitting surfaces don't benefit from this

The problem I want to solve is, when 3D rendering, produce the better possible image equivalence between two resolutions without a huge computation time, by now, this is the better method I could find.



I was thinking, when I open an image on my computer and look at the screen, if I get away from the screen and look at the image I will obtain a "perfectly" downscaled image.

So again, imagining that 3D rendering simulates real word, I tough, why not display an image and render it? I got some interesting results.

The first image I used was a rendered Cornell Box, the same scene as I had provided before, but this time I rendered it on Mitsuba with the following settings:

- Integrator:                         Virtual Point Light
- Sampler:                            Independent Sampler (NOT USED)
- Resolution:                         1024x1024
- Reconstruction filter:              Box filter          (NOT USED)
- Shadow map resolution:              1024
- Maximum depth (light bounces):      -1
- Clamping factor:                    0.1
- Samples per pixel (virtual lights): 100000

Notice it doesn't use any anti-aliasing, I increased the number of Virtual Point Lights to 100000 to reduce image differences from different resolution renderings.

I got this 1024x1024 image: 1024

Then I rendered the same scene with the same settings, but setting resolution to 512x512: 512_rendered

I visually compared both images and they looked pretty much the same, so I continued...

I created a Mitsuba scene where I could render a texture, so I rendered the output of the 1024x1024 3D rendered Cornell Box to 512x512 with it: 512_rendered_texture

To my surprise the resulting 512x512 image is almost identical to the 512x512 3D rendered Cornell box, except for the light emitting plane, that again, is not anti-aliased.

I looked for a way to check image difference and found https://huddle.github.io/Resemble.js which tells me the images are 0.05% different, but if you remove the light emissive surface it would be 100%. This percentage is using the option Ignore less and doesn't take into account very small differences of color.

I tried changing slightly the scene by replacing the small box by a 3D fruit basked I downloaded. I followed the same process and obtained the exact same result, 0.05% difference.

//////////////

Then I tried this 1024x1024 vector graphics: 1024

Resized on Inkscape to 512x512: 512

Resized on GIMP to 512x512 with Bilinear filter: 512_bilinear

And the 512x512 rendered texture: 512_rendered_texture

I get 0.66% difference on this, and resizing to 512x512 using Bilinear gives 0.36% difference between it and original downscaled vector.

It's interesting that the 3D rendered texture and Bilinear downscaling are very similar on both examples and differences are mainly on edges:

Example Bilinear 3D texture
Cornellbox 1 0.07% 0.05%
Cornellbox 2 0.06% 0.05%
Car SVG 0.36% 0.66%

One important thing I observed is that the 3D rendered texture provides much more smooth (anti-aliased) edges, observe the front of the car for example.

//////////////

Another example of difference of edges.Resized from 1024x1024 to 512x512 with Lanczos on GIMP: lanczos

With Bilinear on GIMP: bilinear

With 3D rendered texture method: texture

Another important thing is that jagged edges don't exist on real world!

//////////////

I'm providing the Mitsuba scene file, it's just a XML file, I believe it will only run on Mitsuba 0.5.0. On the XML there's comments letting you know how to use it.

Scene file: mitsuscaler.zip

Please feel free to use and test it.

Unfortunately I won't have much time to test this for almost two weeks myself, I will be busy :sweat:

HybridDog commented 8 years ago

I was thinking, when I open an image on my computer and look at the screen, if I get away from the screen and look at the image I will obtain a "perfectly" downscaled image.

You could try following: Use a camera with long exposure time and take a photo of some 2d thing, then take another photo of it from a different distance. The real scaling can be seen when comparing the pictures, can't it?

Note that this is very inaccurate because the lighting differs after changing the distance due to your shadows etc., the camera automatically adjusts colours, the image is distorted due to the fov, the pictures are saved as jpg (although there are much more efficient formats, such as bpg or daala: https://people.xiph.org/~jm/daala/revisiting/compare.html) and so on.

Anthony-Gaudino commented 8 years ago

@HybridDog, I did many different tests with some help from others with this idea in mind, but as you said it doesn't get coherent results, each image set has distortions due to lens zooming, lighting changes, etc, even a test made in a Switzerland lab didn't provided accurate results.

To make a test like this would require a super well prepared environment with extreme tolerances.


My tests shown that the renderer produces images with significantly less aliasing. It also seems that it can produce images that are more similar to how I perceive the images (using my own eyes) when I look at them from afar. Yes, I did many simple tests just using my vision, which is not scientific :) .

Also I noticed that the better anti-aliasing occurs because edges get more clear, like if they are reflecting more light. I did a simple test in Gimp in which I increased the image contrast/brightness and I got similar results to what happens on the edges of the rendered textures. So, while rendering, it seems that light shaded areas of the texture interfere with darker ones, making the latter a little more brighter.


Here are some simple tests I created:

lines

checkerboard

zone_plate

drawings

On those tests, the image on the left is the original image that was not changed in any way, the image on the right was first downscaled to half it's size(using the method written on the right) and then upscaled back using nearest neighbor. The final upscaling with nearest neighbor doesn't introduce any artifacts and makes the image the same size as the original.

To compare the downscaled image with how the original image would look like downscaled in real life, one must distance itself from both of the images at the double of an optimal distance, so you won't notice the pixels and you will see the images with half their size.

The optimal distance depends on the pixel pitch of the monitor, it's the distance that makes your screen equivalent to an Apple Retina screen, and can be easily calculated online: isthisretina.com, stari.co/tv-monitor-viewing-distance-calculator.

For example, a 21.5" monitor with 1920x1080 resolution, has an optimal viewing distance of about 86cm. Since we want to compare the images downscaled by half, the person must double this optimal distance, so instead of 86cm, he must go away 172cm from the screen.

At this distance, looking at the test images, the rendered texture set images should look much more similar compared to other image sets.


I had contacted Dr. Jakob, who created the Mitsuba renderer, I told him about the tests I was doing and sent them to him. Here's what he said:

Unfortunately I have to tell you that rendering a quad with a texture in Mitsuba doesn’t do anything special as far as light transport goes. Mitsuba will sample the texture at random locations and use an EWA filter for lookups (similar to anisotropic texture filtering on GPUs). Everything will be convolved with a reconstruction filter, which is usually a Gaussian by default. The sequence of these steps will slightly blur your input image — that’s it! ;). Everything can be described in terms of signal processing operations that are likely vastly inferior to what Photoshop and friends can do, and no light transport/detailed physical simulation is involved.

But then I told him I had disabled all filters, and he answered me saying that unfortunately he has no time to help me, I understand he's busy, also as I made it clear to him my knowledge is CG is small and all this was my own unscientific research, so I understand that he wasn't interested in spending his time with all the "mambo jambo" I sent him.

Since I disabled all filters maybe there are other internal filters that I can't disable. Also, since I did tests with two different rendering techniques, the Virtual Point Light and the Path Tracer and both gave the same results then I think that Dr. Jakob is right in saying that light transport doesn't play any role in this.

If the light transport doesn't play any role, then which steps does the renderer use to downscale the image, this is what I would like to find out.


Dr. Jakob also pointed to this research: https://graphics.ethz.ch/~cengizo/imageDownscaling.htm, which looks great, unfortunately I didn't find any implementation, the algorithm is provided in the paper tough.

Unfortunately I don't have time to implement this myself, I thought it would be nice to have this implemented in ImageMagick, and then I could also compare it against what the renderer produces.

HybridDog commented 5 years ago

unfortunately I didn't find any implementation, the algorithm is provided in the paper tough.

I have implemented it. The algorithm is great in my opinion. https://gist.github.com/HybridDog/dd95a99d411972f030fb18543280ad60#file-ssim_perceptual_downscaling-c It even scales down dithering noise, which sometimes is not desired. I suspect that vector graphics are less sharp than the results of this downscaling algorithm, so if you want to use it for training, you could downscale the original image by a factor of less or equal to 1/2 with SSIM based downscaling, use this as reference and then downscale this reference with bicubic filter for training.

HybridDog commented 5 years ago

bicubic: cubic

nearest: none

ssim based: ssim