richardlehane / siegfried

signature-based file format identification
http://www.itforarchivists.com/siegfried
Apache License 2.0
223 stars 30 forks source link

Illegal control character in JSON output #165

Closed Slange-Mhath closed 2 years ago

Slange-Mhath commented 3 years ago

Hey @richardlehane,

Is it possible that control characters in the JSON output are not escaped correctly? Running Siegfried with a file that has for instance the name "/tmp/test_folder/^Test_file 1" (linux output might look like "/tmp/test_folder/?Test.log") seems to output an invalid JSON.

A way to replicate that error would be to create a new text file with a control character inside and run a scan with sf -json. The JSON output will contain the control character, which might result in an invalid JSON format.
Not sure if this might affect other output formats (like YAML) as well.

richardlehane commented 3 years ago

When I read this, I thought ... of course this can't be possible, I must be using golang's json package to write safe json output, and surely that library is thoroughly battle-tested. Well... I just checked the code and, embarrassingly, the json output is hand-crafted & indeed this is very definitely possible! There is some "sanitisation" of file names in the output (https://github.com/richardlehane/siegfried/blob/master/pkg/writer/writer.go#L227) but evidently not enough. I'll aim to fix this for the next release

Slange-Mhath commented 3 years ago

Ha! Classic (and not embarrassing at all!) Thank you so much for this. Really appreciate it!

ross-spencer commented 3 years ago

I can't recreate this with the folder structure below and command sf -json test_data | jsonlint or sf -json test_data | python -m json.tool:

ross-spencer:/tmp$ ls -la test_folder
total 92
drwxrwxr-x  2 ross-spencer ross-spencer  4096 Jun 30 17:09  .
drwxrwxrwt 26 root         root         73728 Jun 30 17:10  ..
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:02 '\afile'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:05 '\atest'
-rw-rw-r--  1 ross-spencer ross-spencer     5 Jun 30 13:17 '?file'
-rw-rw-r--  1 ross-spencer ross-spencer     5 Jun 30 13:17 '^file'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:09 '^M'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 13:18  @test
-rw-rw-r--  1 ross-spencer ross-spencer     5 Jun 30 13:17 '^?test_file'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:07 '^Test_file'
-rw-rw-r--  1 ross-spencer ross-spencer     0 Jun 30 17:08 '?Test.log'

Siegfried then outputs the filenames:

    "filename": "test_folder/?Test.log",
    "filename": "test_folder/?file",
    "filename": "test_folder/@test",
    "filename": "test_folder/\\afile",
    "filename": "test_folder/\\atest",
    "filename": "test_folder/^?test_file",
    "filename": "test_folder/^M",
    "filename": "test_folder/^Test_file",
    "filename": "test_folder/^file",

That aside though I wanted to point y'all at the test here I wrote recently which might be useful to copy of modify to simulate the error during the fix/refactor:

https://github.com/richardlehane/siegfried/blob/6cd46fe8d61f7a306b09ad78dd1a43097fad0389/cmd/sf/wikidata_test.go#L260-L272

jamesmooneybodleian commented 3 years ago

I suspect how control characters are shown on the filesystem differs with operating system.

On RedHat 7, when creating test files with control characters using Control-v then Control-a (Control-m, Control-o) they appear like this:

-rw-r--r-- 1 root root 0 Jun 30 18:06 ?testing_control_character_a -rw-r--r-- 1 root root 0 Jun 30 18:09 ?testing_control_character_m -rw-r--r-- 1 root root 0 Jun 30 18:05 ?testing_control_character_o

siegfried_json_output.txt

Attached json output from siegfried which shows the included control characters.

However on Ubuntu, a control character named file looked like this: -rw-rw-r-- 1 root root 0 Jun 30 18:08 ''$'\017''testing'

siegfried_json_output_ubuntu.txt Attached ubuntu json output, also includes control character.

richardlehane commented 3 years ago

for reference, here is how the standard library escapes json strings: https://golang.org/src/encoding/json/encode.go?s=8400:8459#L1028 Key thing missing is escaping of bytes < 0x20

ross-spencer commented 3 years ago

With some time today I used those tests mentioned above to (I believe) create a test that confirms the issue generically (i.e. not contingent on OS differences) that can be used to develop from here: https://github.com/richardlehane/siegfried/pull/167 (it tests and passes for true-negatives and passes, and true-positives and passes, so it will fail when the issue is fixed - I figured a "pass" would generate less noise although I appreciate it might be less intuitive).