mediachain / mediachain-indexer

search, dedupe, and media ingestion for mediachain
33 stars 14 forks source link

Survey of image dedupe datasets #1

Closed autoencoder closed 7 years ago

autoencoder commented 8 years ago

Key requirement - contains at least 2 different images of each high-level concept. E.g. multiple augmented versions of each image, or multiple images of each real-world object.

California-ND - An Annotated Dataset For Near-Duplicate Detection In Personal Photo Collections: LINK: http://vintage.winklerbros.net/californiaND.html DESC: An Annotated Dataset For Near-Duplicate Detection In Personal Photo Collections. STATS: 701 photos taken directly from a real user’s personal photo collection.

Copydays Dataset: LINK: http://lear.inrialpes.fr/~jegou/data.php#copydays DESC: The Holidays dataset is a set of images which mainly contains some of our personal holidays photos. The remaining ones were taken on purpose to test the robustness to various attacks: rotations, viewpoint and illumination changes, blurring, etc. STATS: Dataset size: 1491 images in total: 500 queries and 991 corresponding relevant images

The Oxford Buildings Dataset: DESC: Consists of 5062 images collected from Flickr by searching for particular Oxford landmarks. The collection has been manually annotated to generate a comprehensive ground truth for 11 different landmarks, each represented by 5 possible queries. LINK: http://www.robots.ox.ac.uk/~vgg/data/oxbuildings/

Image Manipulation Dataset: LINK: http://www5.cs.fau.de/research/data/image-manipulation/ DESC: The Image Manipulation Dataset is a ground truth database for benchmarking the detection of image tampering artifacts. The idea is to "replay" copy-move forgeries by copying, scaling and rotating semantically meaningful image regions. Additionally, Gaussian noise and JPEG compression artifacts can be added, both on the snippets and on the final tampered images. STATS: 1GB, 48 base images and manipulated versions.

CASIA v2.0: LINK: http://forensics.idealtest.org/ DESC: Focused on splicing detection evaluation. Image splicing is defined as a simple cut-and-paste operation of image regions from one image onto the same or another image without performing post-processing. It is a fundamental operation of tampering. CASIA V2.0 is with larger size and with more realistic and challenged fake images by using post-processing of tampered regions. STATS: 7491 authentic and 5123 tampered color images, various from 240×160 to 900×600 pixels.

IEEE IFS-TC Image Forensics Challenge Dataset: LINK: http://ifc.recod.ic.unicamp.br/fc.website/index.py?sec=5 DESC: The forged images created with copy/pasting operations are created using up-to-date image editing software such as GNU Gimp, Adobe Photoshop CS5 etc. using algorithms such as: Content-Aware Fill and PatchMatch (for copy/pasting); Content-Aware Healing (for copy/pasting and splicing); Clone-Stamp (for copy/pasting); Seam carving (image retargeting); Inpainting (image reconstruction of damaged parts – special case of copy/pasting); Alpha Matting (for splicing). STATS: 2.56GB, 1024 x 768 resolution.

Columbia Uncompressed Image Splicing Detection Evaluation Dataset: LINK: http://www.ee.columbia.edu/ln/dvmm/downloads/authsplcuncmp/ DESC: Copying-and-pasting, or image splicing, is the most common tampering seen today. Although often followed by various post processing techniques, we provide a benchmark set with only the splicing operation so that people can study its effect in a focused way. STATS: In 4cam_auth, there are 183 images, and in 4cam_splc, there are 180. The image sizes range from 757x568 to 1152x768.

Tamper Detection of JPEG Image Due to Seam Modifications Dataset: LINK: http://video.minelab.tw/DETS/ DESC: Images with various levels of seam carving / seam insertion applied. 1. Compress the images in JPEG at QF75. 2. Decompress the images before seam modifications. 3. Retarget the images by using seam modifications for each different tampering rates and classify into different dataset, namely 1%, 2%, 5%, 10%, 20%, 30%, 50%, and mixed set. STATS: ~2000 original images, and derivitive images with different combinations of manipulations applied.

CoMoFoD - New database for copy-move forgery detection: LINK: http://www.vcl.fer.hr/comofod/comofod.html DESC: We applied several types of transformations, and grouped images in 5 categories according to applied transformation: 1. translation - a copied region is only translated to a new location without performing any transformation; 2. rotation - a copied region is rotated and translated to a new location; 3. scaling - a copied region is scaled and translated to a new location; 4. distortion - a copied region is distorted and translated to a new location; 5. combination - two or more transformation are appliedon a copied region before moving it to a new location. --One of six postprocessing methods applied on image: 1. "JC" for JPEG compression; 2. "NA" for noise adding; 3. "IB" for image blurring; 4. "BC" for brightness change; 5. "CR" for color reduction; 6. "CA" for contrast adjustments. STATS: 512 x 512, 200 image sets, 40 images per transformation type, total number of images with postprocessed images = 10400

autoencoder commented 7 years ago

Moving to research docs.