sepinf-inc / IPED

IPED Digital Forensic Tool. It is an open source software that can be used to process and analyze digital evidence, often seized at crime scenes by law enforcement or in a corporate investigation by private examiners.
Other
957 stars 219 forks source link

Support for NIST CAID (Child Abuse Image Database) hash dataset #1127

Closed lfcnassif closed 2 years ago

lfcnassif commented 2 years ago

@herrmannpchw pointed that to me today: https://www.nist.gov/itl/ssd/software-quality-group/national-software-reference-library-nsrl/nsrl-download/non-rds-hash

30mi hashes! I downloaded a subset, seems to have hashes from ProjectVIC, but the last has about ~13mi hashes, if I remember, and the NIST one seems to have different classes then ProjectVIC. Maybe there are other dataset sources. Still not sure if all of them are alert hashes or if there are ignorable hashes.

wladimirleite commented 2 years ago

Hi @lfcnassif!

I was not familiar with this hash set too. It looks like a great idea to use it, especially coming from a reliable source as NIST. One thing that called my attention, in the table of number of files per type, it says that there are 20.4M PNGs (out of 29.1M). PNG is not the most common format we find in CSAM. In our internal hash set (LED), JPEGs are about 93% of the 3M files we have, while there are only ~10K PNGs (~0.3%). Of course it can contain hash to be filtered out (non-CSAM), as you pointed out.

Anyway, as you are already working with many other things, I can take a closer look into this hash set if you think it is a good idea. Initially, I would like to check the intersection between this hash set and the ones we are already using (ProjectVIC, LED, NIST RDS, internal "Common Files"). Then I would try to understand its format and modify the HashDB tool to import this type of file. And finally I would process realistic evidences, to check what kind of hits we are going to see.

lfcnassif commented 2 years ago

Thank you very much @tc-wleite! I would appreciate a lot if you could help with this. I just created this ticket to don't forget, as I'm in a forensic event for the whole week, as you know.

I think the first step would be to find documentation about this hashset, like classification model used (alert and/or ignorable classes) and maybe source sets. Then I would proceed with your planning. Thank you again!

lfcnassif commented 2 years ago

@tc-wleite I just saw they also distribute MD5 hashes of blocks of 4k and 8k first bytes of files in that same page. Did you share your idea with them :-)?

lfcnassif commented 2 years ago

Maybe we could also generalize the partial hash based carving algorithm to handle their block sizes in another ticket.

dwhitenist commented 2 years ago

If you have questions about any of the NSRL sets, contact us at nsrl(at)nist.gov

wladimirleite commented 2 years ago

I couldn't find any document with a description.

Analyzing the hash set content, first thing that I noticed was that there is a "category" property, but all items have "8". Second, comparing to other hash sets we use, it doesn't seem to contain hashes of CSAM. Here are the intersection numbers:

  Hashes:     29,084,504
------------------------
NSRL RDS:      7,856,880
     LED:              0
    CFPF:        384,872
     VIC:        567,605
          --------------
          Cat 0: 567,535
          Cat 1:       1
          Cat 2:       7
          Cat 3:      62
------------------------
 Unknown:     21,036,841

I also tried to make the comparison with previous versions, in case the latest has some issue, but results were similar.

I sent a message to NSRL/NIST asking about this hash set and already got the response below (many thanks @dwhitenist!), which confirmed my observation based on our other datasets.

The NSRL CAID dataset contains the set of all multimedia file hashes in the NSRL database, which are benign in nature, with the main use case for the CAID dataset meant to be used to weed out known multimedia files while searching seized devices, which may contain child exploitation media.

wladimirleite commented 2 years ago

I just imported this hash set (make a quick external conversion to CSV and then used the "iped-hashdb.jar"). I will run a test with a few forensic evidences, to check if it may be useful to ignore multimedia files.

lfcnassif commented 2 years ago

Thank you @tc-wleite for your analysis! Could you share with me the hashes present in ProjectVIC? I'll contact them to suggest a review of the original files in their dataset. Maybe we should also send those hashes to NIST asking for a double check too.

lfcnassif commented 2 years ago

Thank you @tc-wleite for your analysis! Could you share with me the hashes present in ProjectVIC?

I mean the hashes in category != 0

wladimirleite commented 2 years ago

Thank you @tc-wleite for your analysis! Could you share with me the hashes present in ProjectVIC?

I mean the hashes in category != 0

Sure, that was my understanding.

dwhitenist commented 2 years ago

Analyzing the hash set content, first thing that I noticed was that there is a "category" property, but all items have "8".

Category "8" is used in the UK to designate benign content; category "0" is used in US for that purpose and NSRL will be addressing that with the upcoming release out in a few weeks.

wladimirleite commented 2 years ago

Thank you @tc-wleite for your analysis! Could you share with me the hashes present in ProjectVIC? I'll contact them to suggest a review of the original files in their dataset. Maybe we should also send those hashes to NIST asking for a double check too.

I think the following hashes could be moved to category 0 in ProjectVIC (I found them here and checked their content).

Current VIC Category: 2
124d94d27c159a915c1a5c0ad8911682
2f3d01cd5deec078670774e2e249fe59
41c19b1f2c4bfee8febf9000da3842cf
71ad31efd4e749a2e23b706c15db73ae

For example, 71AD31EFD4E749A2E23B706C15DB73AE is 71AD31EFD4E749A2E23B706C15DB73AE

wladimirleite commented 2 years ago

I ran a test processing 4 HDDs images (each from a different case) and 4 cell phone extractions (also from different cases). Data carving was disabled. The number of hits for each hash set below considers only unique hashes (duplicate filter was enabled). The total number of unique items was ~3M.

Hash Set Number of Hits Exclusive Hits
NIST CAID 33,418 21
Project VIC 95,346 1,394
NSRL RDS 179,877 6,063
CFPF 557,307 340,922

"Exclusive Hits" means hits only in that hash set (i.e. not present in the other 3 hash sets). For ProjectVIC, only hits from category 0 were counted. "CFPF" is our new internal "Common Files" hash set.

Each hash set has a different purpose, so a direct comparison of the numbers does not make much sense. However, as we are already supporting Project VIC and NSRL RDS, I am not sure if it is worth handling NIST CAID's JSON format inside the HashDB importing tool, at least for now (we can always review this again in the future). Besides that, if some user wants to use this hash set, it is possible to convert its JSON files to a CSV and then import it (as I did for this test). What do you think, @lfcnassif?

wladimirleite commented 2 years ago

A random sample (sorted by hash) of NIST CAID hits in the test case.

image

lfcnassif commented 2 years ago

Thank you @tc-wleite for your tests!

Besides that, if some user wants to use this hash set, it is possible to convert its JSON files to a CSV and then import it (as I did for this test). What do you think, @lfcnassif?

I also thought about this approach and I think it is a reasonable solution for now. We can leave this issue open to implement this direct support in the future. @tc-wleite if you could share your conversion code somewhere and post the link here, I think it could help users interested in importing NIST CAID hashes to IPED hash database.

lfcnassif commented 2 years ago

For example, 71AD31EFD4E749A2E23B706C15DB73AE is 71AD31EFD4E749A2E23B706C15DB73AE

And could you send me the remaining hash hits in category > 0 of ProjectVIC?

wladimirleite commented 2 years ago

@tc-wleite if you could share your conversion code somewhere and post the link here, I think it could help users interested in importing NIST CAID hashes to IPED hash database.

For this test, I didn't implement a proper JSON parser. Just wrote a quick-and-dirty "string based" conversion code. Anyway, here it is:

import java.io.*;
public class NistCaidToCsv {
    public static void main(String[] args) throws Exception {
        File inputDir = new File("/path/to/NSRL-CAID-JSONs");
        File outputFile = new File(inputDir, "nsrl-caid.csv");
        System.setOut(new PrintStream(new FileOutputStream(outputFile)));
        System.out.println("MD5,SHA1,set");
        File[] files = inputDir.listFiles();
        for (File file : files) {
            if (file.getName().endsWith(".json")) convert(file);
        }
    }
    private static void convert(File file) throws Exception {
        BufferedReader in = new BufferedReader(new FileReader(file));
        String line = null;
        while ((line = in.readLine()) != null) {
            if (line.indexOf("MediaID") > 0 && line.indexOf("SHA1") > 0) {
                String[] tokens = line.replace('"', ' ').split("[,:]");
                for (int i = 0; i < tokens.length; i++) {
                    String token = tokens[i].trim();
                    if (token.equals("MD5")) System.out.print(tokens[++i].trim() + ",");
                    if (token.equals("SHA1")) System.out.print(tokens[++i].trim() + ",NistCAID");
                }
                System.out.println();
            }
        }
        in.close();
    }
}
wladimirleite commented 2 years ago

And could you send me the remaining hash hits in category > 0 of ProjectVIC?

Sure! I only collected hashes for categories 1 and 2. I will run the comparison again including category 3 and will send the hashes to you privately.

EDIT: Just sent the file with NIST CAID and Project VIC (categories 1, 2 and 3) intersection.

lfcnassif commented 2 years ago

Category "8" is used in the UK to designate benign content; category "0" is used in US for that purpose and NSRL will be addressing that with the upcoming release out in a few weeks.

Hi @dwhitenist. I'm testing @tc-wleite's implementation to support importing NIST CAID (#1177). Just a quick question, do you at NIST have any plans to include non ignorable/alert hashes into CAID hash set?

lfcnassif commented 2 years ago

@tc-wleite's import tool also found several entries in latest NIST CAID 0220526 version with invalid hash lengths, e.g. for MD5 and SHA1 hashes, I manually confirmed that with some records, I think NIST should take a look at this...

lfcnassif commented 2 years ago

@tc-wleite's import tool also found several entries in latest NIST CAID 20220526 version with invalid hash lengths, e.g. for MD5 and SHA1 hashes, I manually confirmed that with some records, I think NIST should take a look at this...

Sorry, that was with version 20220301, my fault, I'll run the tool on the latest version now.

dwhitenist commented 2 years ago

Category "8" is used in the UK to designate benign content; category "0" is used in US for that purpose and NSRL will be addressing that with the upcoming release out in a few weeks.

Hi @dwhitenist. I'm testing @tc-wleite's implementation to support importing NIST CAID (#1177). Just a quick question, do you have any plans to include non ignorable/alert hashes into CAID hash set?

No, the NSRL is not in a position to collect source material that would be in an "alert" category.

lfcnassif commented 2 years ago

No, the NSRL is not in a position to collect source material that would be in an "alert" category.

Ok, thank you very much for your fast answer!

wladimirleite commented 2 years ago

@tc-wleite's import tool also found several entries in latest NIST CAID 20220526 version with invalid hash lengths, e.g. for MD5 and SHA1 hashes, I manually confirmed that with some records, I think NIST should take a look at this...

Sorry, that was with version 20220301, my fault, I'll run the tool on the latest version now.

Using the latest version, hash lengths seem to be fine. In a previous version, I noticed that SHA-1 hashes seem to have more characters than expected.

lfcnassif commented 2 years ago

Using the latest version, hash lengths seem to be fine.

Yes, I just confirmed all hashes from latest version were imported fine, sorry for the noise.

lfcnassif commented 2 years ago

Resolved by #1177