results: define output format/schema

mr-tz commented 1 year ago

to store and exchange results we'll need a new output schema, likely json

the UI will render this data (or parts of it, when they become available although this should be quick)

again, likely an array of objects (combining all other keys from the databases?) should work here

ooprathamm commented 5 months ago

@mr-tz I have worked on adding new format to parse output json back to capa in the past [PR_#1396]. Can I look into this ?

mr-tz commented 5 months ago

Sounds great, please take a look and let's discuss if you have any questions or a design draft.

ooprathamm commented 5 months ago

(combining all other keys from the databases?) should work here

Could you please shed some light on this one.

williballenthin commented 5 months ago

QS uses a bunch of embedded databases to provide context about strings. Things like prevalence, library, version, etc. So all the information from each database should be merged into records about each recovered string.

ooprathamm commented 5 months ago

@williballenthin @mr-tz for further discussion and inputs, I have created a new PR #972 :)

mr-tz commented 5 months ago

Hi @ooprathamm, pulling the discussion to this issue.

On a higher design level we'll have to see how we want to deal with structure vs. tagged strings vs. other functionality. Ideally, we can decouple the storage and logic a bit. The current POC implementation is quite elegant but IMO combines multiple features potentially complication further work. On the other hand, we may keep the extraction logic and just change the resulting document.

In my head I currently have something like (based on some of your work, here, thanks!):

{
    "strings": {
        "static_strings": [
            {
                "string": {
                    "encoding": "ascii",
                    "slice": {
                        "range": {
                            "length": 40,
                            "offset": 77
                        }
                    },
                    "string": "!This program cannot be run in DOS mode."
                },
                "structure": "pe.header",
                "tags": [
                    "#common"
                ]
            },
            {
                "string": {
                    "encoding": "ascii",
                    "slice": {
                        "range": {
                            "length": 12,
                            "offset": 11644
                        }
                    },
                    "string": "VirtualQuery"
                },
                "structure": "import table",
                "tags": [
                    "#winapi",
                    "#common"
                ]
            }
        ]
    }
}

And/or we add a meta section storing the optional layout (PE, ELF) of a file.

This may require further discussion and be a larger effort but I'd be curious to hear your thoughts.

williballenthin commented 5 months ago

Thanks for re-sparking this discussion @mr-tz.

I think things like: location, length, encoding, and content of the string is part of the definition of the (static) string and should be at the top level. Or under .string exactly as @mr-tz proposes.

Other information, like: structure, tags, and prevalence are more like "context" - things we assess about the string beyond its definition. I suspect each database/algorithm can provide its own context and we haven't explored all of them yet. So maybe all this context gets grouped together in an extensible way.

File layout seems orthogonal to (static) strings and probably should be stored separately from the strings. A presentation layer could stitch together all the data and make it look pretty.

ooprathamm commented 5 months ago

Thanks for the review @mr-tz @williballenthin I agree the current poc restricts further work. Thanks for providing a view on the desired output structure. I appreciate your detailed explanation. I agree that decoupling the storage and logic could provide us the basis for incorporating advanced features without overcomplicating as done by floss. Given the points you've raised, I'm eager to incorporate your suggestions into the pull request.

mandiant / flare-floss

results: define output format/schema #721