glevyhas / pix-plot

A WebGL viewer for UMAP or TSNE-clustered images
MIT License
593 stars 139 forks source link

Handling the same image filename in multiple folders #186

Open vdet opened 3 years ago

vdet commented 3 years ago

Hi again,

I tried to display datasets up to ~600.000 images (pre-computed UMAP). Here is some feedback:

This being said I could already get insight from the subset of image I could view. Thanks!

Vincent

duhaime commented 3 years ago

Many thanks for this @vdet! Now this is interesting. We haven't tested with plots this large in a little while!

We don't have much guidance on the cell_size argument yet. One of the relevant factors in that consideration would be how long you can wait for the plots to load--if you don't mind waiting a bit, you could afford to use a larger cell size, but you may end up downloading a few hundred MB while fetching the atlas files, aka the large images that each contain many small images, stored in ./output/data/atlases/{{ plot id }}/atlas-{{ atlas index }}.jpg. If you want others to access your plots, I'd aim for something smaller than the default. Possibly 8 or 10 px? The original Google Arts & Culture project used 16px of width as the constraining dimension, and they squeezed ~400,000 images into their viewer!

I'm curious as to the details of your images. All of the images should have been plotted in the scene. But some images may have been filtered out during the data prep stage. As you may know, when the user provides their list of images to be processed, we iterate through them and resize them into little images that have height cell_size and the width required to keep the image's aspect ratio.

If an image has 0 width after being resized (or the image has a height or width that's larger than the atlas size, which is 2048px by 2048px) it won't be retained in the atlases or the plot. I have a hunch that filter is what's causing the missing images. You can check the full list of images that were processed in ./output/data/imagelists/imagelist-{{ plot_id }}.json

The idea of using non-integer cell sizes is interesting! I'm actually not sure what would happen in that case. But I would probably use a cell size of 6 or 8 at a minimum...

duhaime commented 3 years ago

Just thinking a little more about this @vdet there are some other considerations that might be at play. If the number of images in the output image list is greater than the number of displayed images, it could be because of limitations in the GPU card of the host that's running the visualization.

You can get some information on your system's GPU card here. If the atlases are too big, one can bump up against the Max Texture Image Units value. Or possibly we are packing more into a single draw call than your GPU can handle (there are a few GL parameters that influence the max draw call size).

To alleviate concern over any of these lower-level issues, I'd start by counting the images in the generated image list (alternatively, you could check data.cells.length from your browser console). If that number is reduced, then it's case closed. Else we may need a little more information to figure out why some images are not appearing!

vdet commented 3 years ago

The images have all the same size: 224 x 224 px^2. I actually tried to set --cell_size to 8 when I saw that images were discarded with 16, but it made no difference as far as I could tell from visual inspection. The file ./output/data/imagelists/imagelist-{{ plot_id }}.json reports all images (592106) I provided. Now I am no GPU expert. The WeGL reports indicates:

Max Texture Size: | 8192
Max Cube Map Texture Size: | 8192
Max Combined Texture Image Units: | 80
Max Anisotropy: | 16

The GPU is an AMD Radeon R9 M395 in a late 2015 27" retina iMac.

All the best, Vincent

vdet commented 3 years ago

Hello, The issue of images not displayed was caused by a basic name conflict problem: if my images are organized like this

dir_A/img.png
dir_B/img.png #not the same image as dir_A/img.png

and pixplot.py is invoked with --image "*/img.png", only one img.png, not both, will be in the relevant data subdirectories. These subdirs have flat structures and the different img.png are overwritten. This would be avoided would pixplot image storage dir structure mirror the one provided by the user.

All the best, Vincent

duhaime commented 3 years ago

@vdet many thanks for your note. Yes this is something we've thought about. The only challenge is that the filename serves as the foreign key that lets us create the connections between images and metadata rows. This means that if we process the full or relative image path when processing images, the user would have to include that path in their metadata, which could be quite challenging.

What if we displayed a little warning indicating that there are duplicate filenames in the input dataset--would that be sufficient? I'm open to other ideas instead!

vdet commented 3 years ago

Hi Douglas,

This means that if we process the full or relative image path when processing images, the user would have to include that path in their metadata, which could be quite challenging.

The user does provide a path already: to be able to use an image glob of the form --image "*/img.png", I specified the full paths to each image, dir_A/img.png,dir_B/img.png, etc., the filename column in my metadata file. Otherwise, there would be no way for pixplot to match the images and the metadata in the first place. Any user who wish to use a non-flat dir structure must of course be able to establish that mapping. At some point pixplot.py removes the paths in the metadata's filename column. Why not leave that filename column intact and mirroring the user dir hierarchy for the image-specific files?

Note that non-flat dir structures naturally occur in many contexts, when

Well I of course don't understand the inside out of your code, and can find an upstream fix for my application. Anyhow, a warning would sure help, I my case the effect was massive I could not miss it. In other cases some a few image will be missing while others will have the wrong metadata and land in the wrong spot in the geographic view (what helped my pinpoint the problem).

Thanks,

Vincent

duhaime commented 3 years ago

@vdet aha! Your note is very helpful. Right now, we process just the "basename" of each image and map that to the basename of the image specified in the metadata inputs (source).

The motivation for using the file basenames was to allow users to create a static representation of their metadata that doesn't include relative or fully-qualified paths, as either of the latter would be cumbersome if one were to move the data.

Perhaps we should check to see whether the images attribute in the metadata contains path demarcators and if so, join on the paths specified? How does that sound?

vdet commented 3 years ago

Hi Douglas,

The motivation for using the file basenames was to allow users to create a static representation of their metadata that doesn't include relative or fully-qualified paths, as either of the latter would be cumbersome if one were to move the data.

They can put all their dirs in one big dir if they need to move all images at once. This is actually what I did in my real life application with a root image dir containing sub dirs.

Perhaps we should check to see whether the images attribute in the metadata contains path demarcators and if so, join on the paths specified? How does that sound?

Not sure I understand what you mean. I'd suggest to leave the metadata filename column intact, use it as image path, and mirror the user dir structure in data/thumbs, etc.

All the best,

Vincent

duhaime commented 3 years ago

I changed the title of this issue to better reflect the open task...