target / strelka

Real-time, container-based file scanning at enterprise scale
Other
857 stars 111 forks source link

[BUG] Failed to serialize event #419

Closed andrea-matsec closed 8 months ago

andrea-matsec commented 8 months ago

Describe the bug Strelka fails to serialize events. I believe this is happening only when there's a pdf_load_error but I'm not 100% certain.

Environment details

Steps to reproduce Unfortunately I can't share the file to reproduce this error, but this is the even that can't be serialized.

{
    'file': {
        'depth': 0, 
        'flavors': {
            'mime': ['application/pdf']
        }, 
        'scanners': ['ScanEntropy', 'ScanExiftool', 'ScanOcr', 'ScanPdf', 'ScanYara'], 
        'size': 220710, 
        'tree': {
            'node': '6d343886-98a2-4258-8d99-9b0be8d4f63a', 
            'root': '6d343886-98a2-4258-8d99-9b0be8d4f63a'
        }
    }, 
    'scan': {
        'entropy': {
            'elapsed': 0.000211, 
            'entropy': 7.9965489035155235
        }, 
        'exiftool': 
            {
                'elapsed': 7.299652, 
                'sourcefile': '/dev/shm/tmpp5fr8i53', 
                'exiftoolversion': 12.6, 
                'filename': 'tmpp5fr8i53', 
                'directory': '/dev/shm', 
                'filesize': '221 kB', 
                'filemodifydate': '2024:01:03 22:49:24+00:00', 
                'fileaccessdate': '2024:01:03 22:49:24+00:00', 
                'fileinodechangedate': '2024:01:03 22:49:24+00:00', 
                'filepermissions': '-rw-------', 
                'filetype': 'PDF', 
                'filetypeextension': 'pdf', 
                'mimetype': 'application/pdf', 
                'pdfversion': 1.7, 
                'linearized': 'Yes', 
                'encryption': 'Standard V5.6 (256-bit)', 
                'warning': '[minor] Decryption is very slow for encryption V5.6 or higher', 
                'useraccess': 'Print, Modify, Copy, Annotate, Fill forms, Extract, Print high-res'
            }, 
            'ocr': {
                'elapsed': 0.026397, 
                'flags': ['uncaught_exception'], 
                'exception': 'Traceback (most recent call last):\\n\\n  File \\"/usr/local/lib/python3.10/dist-packages/strelka-0.0.0-py3.10.egg/strelka/strelka.py\\", line 779, in scan_wrapper\\n    self.scan(data, file, options, expire_at)\\n\\n  File \\"/usr/local/lib/python3.10/dist-packages/strelka-0.0.0-py3.10.egg/strelka/scanners/scan_ocr.py\\", line 29, in scan\\n    data = doc.get_page_pixmap(0).tobytes(\\"png\\")\\n\\n  File \\"/usr/local/lib/python3.10/dist-packages/fitz/utils.py\\", line 922, in get_page_pixmap\\n    return doc[pno].get_pixmap(\\n\\n  File \\"/usr/local/lib/python3.10/dist-packages/fitz/fitz.py\\", line 5447, in __getitem__\\n    raise IndexError(\\"page not in document\\")\\n\\nIndexError: page not in document\\n'
            }, 
            'pdf': {
                'elapsed': 0.025871, 
                'flags': ['pdf_load_error'], 
                'images': 0, 
                'lines': 0, 
                'words': 0, 
                'xref_object': set()
            }, 
            'yara': {
                'elapsed': 0.000709, 
                'rules_loaded': 1, 
                'matches': ['test']
            }
        }
    }

'xref_object': set() looks suspicious to me.

Expected behavior The event can be serialized

Release

Additional context Add any other context about the problem here.

phutelmyer commented 8 months ago

Thanks for reporting this @andrea-matsec. I have a fix for this (to be honest, I thought I already implemented - must have been a dream).

I'll push it out, along with a new release, tomorrow.