BLR content analysis: comparing images

myrmoteras commented 6 years ago

via Miroslav Valan

as promised, the exploratory analysis is now complete and the obtained results are congruent with the expectations. We used cca 8000 randomly chosen samples. The results are attached. Please note there are several clusters (circles) of which one is nicely separated (red) representing mostly corrupted images. The other two bigger clusters are line drawings (blue) and regular plates (green); the magenta cluster contains maps, threes, bars, etc. (rectangles). There are quite a few interesting sub-clusters such as crabs and fishes (two upper green), sponges!? and Lepidoptera (small orange in the middle) and someone has plates with distinct pattern that we see in lower green rectangle.

There are many more things that can be seen so please check out the link below download the big image where instead of the data points I plotted resized images so you can get better understanding why and how the model clusters the images.

We where not able to use all the cases we sampled. Some we did not download and other seem to be corrupted so I just skipped them and did the analysis without them.

I can do the same with the whole dataset but it will take a little bit more time. The takeouts from this are:

We can easily fish-out and handle corrupted images
Plates considered harmful - plates are not machine readable format since to extract an individual object you need specific skill set. Putting images in plates is an extra step for authors which makes things worse for machine readability. I assume that having the individual objects separate, but linked to the same "figure group" would open up this dataset for various machine learning/ statistical approaches.

Links: https://drive.google.com/file/d/1HJWmTQqt_DvEXJbdmEob-CIq7M_z4j7-/view?usp=sharing drawing

myrmoteras commented 6 years ago

Just a question regarding the few broken images: Is it possible to get the links to those images so we can fix them asap? It looks as quiet a few groups are mixed bags, beatles/ants, beatles/spiders, ants/frogs It confirms nicely what we have been discussing about individual figures vs plates

myrmoteras commented 6 years ago

via @punkish

Nice work.

One thing I’d love is to tag each image with tags such as “line drawings”, the colors such as “blue”, “red”, names such as “crabs”, “fishes”, “beetles”, etc. But, I’d like to build an inverted index, so perhaps the best is to simply create a CSV file like so

imageuri: tag, tag, tag imageuri: tag, tag, tag

and then I can create the metadata tables I need that I can then plug into Zenodeo (http://zenodeo.punkish.org/v1) which powers Ocellus (http://ocellus.punkish.org/). That way Ocellus will get these magical powers to see all “blue beetles” and so on.

Let me know if you can generate the above file for all the images.

Many thanks,

Puneet

myrmoteras commented 6 years ago

via Miroslav

Hi Puneet,

I understand. This could be really interesting and fun to have and I think it is doable. However, to annotate 160 000 images is not an easy task.

First of all, plates make this task extremely challenging. I am quite confident the separation would be much nicer if each image is one object. One way to tackle such task is to inspect and annotate some of these sub-cluster manually and we feed that information to the model iterativelly. Another way would be to cut out individual objects from images and run the analysis on them which could give much more sense and therefore simplify the task. Both are time consuming and would require a lot of manual work. This could easily require months of effort which I cannot commit. To annotate such datasets with multiple labels I would recommend the use of Amazon Mechanical Turk (https://www.mturk.com/mturk/welcome).

What I can generate instead is x,y coordinates for each of the images and you can them display them in the 2D space to everyone can zoom in and look at them. Even better would be to allow visitors to select multiple images (say clusters) and annotate them with some predefined labels but also give them freedom to define new ones.

myrmoteras commented 6 years ago

via @gsautter first of all, this is a really cool visualization. Thanks for creating it, even though I'd love to know what the X and Y axes are.

Regarding the broken images: while they do still link to a depositions whose file is a PDF (which they were extracted from), does said PDF deposition also link to them? If not, these images are "orphaned" depositions replaced with newer ones that hold the repaired images. Just could not delete the broken ones ... On a different note, what distinguished the latter cluster? Might well be another bunch of criteria helpful in detecting image decoding problems.

Regarding plates: Unfortunately, plates are the finest granularity that a caption (which becomes the description in the Zenodo deposition) can be assigned at without ambiguity. It is quite possible to also make Zenodo depositions for the individual bitmap images (if the author(s) didn't merge them into one large bitmap, that is) that are parts of the plate proper (just say the word, guys ;-), but those latter depositions could inevitably only come with a descriptions (the plate caption) that only partially applies to them.

myrmoteras commented 6 years ago

via @punkish

Miroslav,

I appreciate the difficulty of the task that you specify, but that is applicable only if one sets out to do what you describe. I am proposing something fairly less ambitious – tag programmatically what you can, and leave the rest alone. This is one of those cases where even an incomplete tagged set, along with proper disclaimers, will be fun and instructive to play with than to not do it at all. I certainly would not bother with the plates. As Guido suggests, we have what we have.

I have no interest in using the Mechanical Turk, for now, and I can’t really understand the use case of x,y coords being displayed in 2D space. From what I imagine you are describing, that is not what I want to add to Ocellus, for now at least. My focus for Ocellus is to keep it very simple and excellent at what it does, that is, a viewer for the images, and extremely simple and mobile first.

myrmoteras commented 6 years ago

@valanm do you have a brief introduction to the approach and technolgy you use in this analysis?

myrmoteras commented 6 years ago

Hi,

the plot is made with t-SNE and it is not easily interpretable in terms of the axes (see here https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf. Unlike PCA which is linear tSNE is non-linear dimensionality reduction technique which is usually more important to keep the low-dimensional representations of very similar datapoints close together. In short, this is just a visualization technique that attempts to preserve local structure and we cannot interpret it quantitatively.

Regarding the broken images: I can provide you with the list of corrupted images after I process all images. We also had issues with downloading images both using zenodeo and the script that Viktor wrote. I assume he will write you about it soon.

Regarding plates: Here is the example of bdj images https://drive.google.com/file/d/108gU6mtXfJIVtXV8L48LSaqzXglFlW3H/view

Note that the model did a significantly better job and that more than half of the dataset is nicely separated.

Cheers, M

JMLR_2008.pdf

myrmoteras commented 6 years ago

the comparison of BDJ single figures analysis vs the hotchpotch of plates, figures at BLR speak of itself. This already does the job to make the case for publishing single figures not composite figures (plates).

We should think of writing this up, first to show this stunning analysis, then as a suggested (best) practice to publishers.

Also, keep in mind, that many of those images are derivatives of images from the ongoing digitization projects, eg iDigBio, Moore foundation plant types/JSTOR, ICEDIG/DiSSCo. and thus having links (related items) to the orginal source files will be important and probably helpful for further analyses.

I am sure, a publication in the right place will attract neirds to start this sort of images analyses, clustering and eventually integrate these into a system that allows to identify organisms.

There are also funny groupings

the hornbill sitting among line drawings of herbs;
a map in line drawings
the herbarium sheets far apart from the line drawings derived from them
the color being a strong separator or trend setter

I assume this analyses are without training?! If training would be involved, how much would the resolution improve, eg to get the historgrams into one gorup?

plazi / Biodiversity-Literature-Repository

BLR content analysis: comparing images #29