vz-risk / VCDB

VERIS Community Database
Other
577 stars 179 forks source link

Single JSON database. #10191

Closed walkerandco closed 6 years ago

walkerandco commented 7 years ago

I think that VCDB needs a single JSON database consisting of an array of all json incidents. Quite a few graphical data analysis tools need this (surprisingly).

I have created a project here in the interim: https://github.com/walkerandco/VCDB-JSON-Merged

But it would be awesome if this could be added to the main project. I have not submitted a pull request because it seems to have been an intentional decision not to have a single JSON file at some point.

gdbassett commented 7 years ago

Hi @walkerandco,

I could make a merged json object part of the /data/ directory in vcdb. What is the use case that is necessitating it?

We have generally not created them as internally we use a single pipeline to conduct analysis starting with a R dataframe (the vcdb.dat file is an example, though we actually parse the vcdb json directly into a single dataframe with all other sources for the dbir). We do this as it ensures we reliably produce the same answer to the same question between analysts and over time.

I do have two minor concerns with creating a list of all son objects, though neither is too big an issue:

  1. It is somewhat redundant with the individual json files (which are necessary for the existing VERIS tooling).
  2. The output would no longer validate against the VCDB schema file (vcdb-merged.json). That said, it's not like the csv or rdata file do either.

Regardless, I'm interested in your uses, thoughts and whatever we can do to make the vcdb better serve those who use it.

walkerandco commented 7 years ago

Hi @gdbassett thank you for answering so quickly.

I am using the dataset in Tableau to produce graphics for a research paper. Tableau can take individual JSON files but it is very bad at it. A single JSON file contained as a JSON array has a much faster loading time (surprisingly) and it is more useful to me than individual JSON incident files.

To answer your concerns:

  1. I don't think it is entirely redundant though it could be for some use-cases. Some software simply requires a single JSON file. Arguably, that is probably a result of terrible design, but nonetheless it creates a little more work having to write code to aggregate the JSON data.

  2. I think you have already pointed out that you have merged CSV and rdata files and so this issue is likely to be salient.

I do think it would enhance the repo to have a single merged JSON file if just to meet outlier use cases and make life easier for analysts/developers. I'm presently using the data to prepare some statistical data for some academic research (and has been extremely helpful by the way). The research relates to inside threats and their statistical relationship to control factors. I have to say it would have been impossible without the dataset so kindly prepared here :).

gdbassett commented 7 years ago

It definitely makes sense to make it easy to analyze the data in Tableau. A quick question, are you able to do the same analysis using the csv file in tableau?

At the very least, I can add a quick python script that reads the son in and creates the single output.

walkerandco commented 7 years ago

That's an interesting question, the short answer is no. Tableau automatically reads in the data and organises it into a self-purposing schema. That actually got in the way of my analysis today and so I'm just modelling and analysing the data in JS instead. When a CSV is used for VCDB, it completely loses the VERIS hierarchy and for some reason the data values don't read in correctly - this is more a Tableau design fault than a problem with the CSV.

In regards to the script, that's essentially what I did in JS, is created a script to read the JSON files into a JSON array and commit to the filesystem. I think it would be very useful. If I can help let me know.

It probably sounds silly, but this also avoids having to write complex bash commands or custom scripts to import individual JSON files into MongoDB too. MongoDB can parse an array after a single upload and separate it into documents - this is many thousands of times faster. The present db is about 15mb which would take a second or so to upload, whereas uploading each file takes, on a DigitalOcean server, about 8 hours non-interactively.

gdbassett commented 7 years ago

Interesting. One of the DBIR authors used mongoDB for analysis, but we ultimately went all R for consistency and was of maintainability in our workflow.

What is the script you used to create the single file written in? If it's python (or something else I have the framework for), I'll take a pull request of it and the current file and whenever I update vcdb, I'll update it along with the other two data files. (btw, could you compress the single jSON file, just for space?

walkerandco commented 7 years ago

I was using Tableau but I felt too limited by using a GUI so I'm using an isomorphic node.js server with angularjs on the front end (it's not quite finished yet). When it's done it will take the data set from this repo on a regular basis and users can just view and browse the data easily with visualisation. I need it for my research anyway so I decided to bundle it up into something others can use (and that I can use). When it's finished I will publish on a new repo.

The script is written in javascript (node.js) because JSON is native to JS its very quick and easy. I would be happy to share and submit pull request if you like?. Of course I will compress the JSON file and I can submit a separate pull request with that if you want.

gdbassett commented 7 years ago

It's ok, I'll rewrite it in python and generate the file as part of testing the script. I'll update here when I do so you can test the data file to make sure I produced something that works in your use case.

gdbassett commented 7 years ago

Ok, added the script and the file. If you can check it and make sure it's what you are looking for, I'd appreciate it. If it is, I'll close this issue. Please read the NOTE.txt in the 'joined' directory before doing any analysis using the joined JSON file.

walkerandco commented 7 years ago

@gdbassett This is looking fine. I'm not sure why but mine is a little smaller than yours (15MB-ish). I think this is because I have omitted duplicates from the skimmer directory. Having just read the note.txt, I can see you have mentioned this. Therefore, I'm going to submit a pull-request called vcdb-normalised.json. I'll also submit an edit to your note.txt reflecting the same.

walkerandco commented 7 years ago

@gdbassett Just as an FYI, I completed a little side project you might be interested which this issue was geared towards https://github.com/walkerandco/verisdb-analyst. I think you might like it. I will integrate JSON unification into this tool hopefully next month. The code is MapReduce and it is a little hacky in parts because of what was raised in other issue I raised.

gdbassett commented 6 years ago

My tendency is to not incorporate the additional file. The records can be filtered out by github ID, etc w/o the need to save two files. That said, I think the original issue here is complete.

gdbassett commented 6 years ago

I added breaches today and when I did I changed the joined file to remove the webapp issues, making it consistent with the other two data files.

walkerandco commented 6 years ago

Perfect, I think that resolves this issue @gdbassett