Handle *very* large cloud accounts

vifor2 commented 5 years ago

Investigate and solve the following issue: Currently the fact that a JSON file is generated for each report and then its data is loaded onto the web browser is not an issue for the majority of cases. But when Scout Suite is used agaisn't a service that is being utilized to it's full potential and has, let's say, over a million ressources, the JSON generated is way too big and therefore crashes the browser once the user tries to open the report.

Possible solution ideas:

Move the JSON file to a local DB; not that complicated but requires a web server to fetch the content.
Move the JSON file to the navigator's memory by using something like IndexedDB; JSON would still need to be loaded to be put in IndexedDB, a solution would be to use FileReader, but the user would have to manually select their file.

x4v13r64 commented 5 years ago

After much thought to this I don't think the second approach will work (so I'm leaning towards the first). If you do find a way then I'm all ears.

What the first approach would probably look like:

Instead of writing everything to a JSON, we would write it to a database
- Most likely sqlite as it leverages a local file and is well documented, stable and easy to use
- The data would probably be stored with a key:value format (sqlite supports this)
- This would be relatively easy as the JSON dict is already a key-value store
- There are libraries/wrappers (https://pypi.org/project/sqlitedict/) that make handling this easy
- A benefit of this is that Scout wouldn't need to store all the data in memory until the end of the scan. It could write the partial data to the local db during the run, which would improve memory consumption (and probably keep latency quite similar, as it does a lot more writes then reads).
A small web server would be needed to fetch the content from the database and serve it to the frontend
- This is sort of what the frontend's JS already does. Currently the JSON is loaded in memory as a dictionary, and a JS function receives an index and looks up in the dictionary the value at that index.
- The new implementation would instead make a request to the backend for that index, the backend would pull the content from the database and return it to the frontend.
  - This would likely require a bit of pagination, as let's say there's 200 buckets we'd only want to show maybe 20 at a time to the user. That would be a great improvement to the UI, as now we're just shoving everything into view
- flask would be a good candidate for the web server, it's lightweight and can easily be run from the CLI
The current flow is for the user to run Scout to generate an html report, and then open it in a browser.
- The only difference between two report html files is the name of the JS/JSON file which is hardcoded.
- The new flow would probably be something like:
- Scout.py --run <params> <report name, explicit or inferred>
  - This generates the sqlite db that stores the results
- Scout.py --serve <report name, explicit or inferred>
  - This starts the web server on 8000, which fetches content from the sqlite db
- Other then that, all the static files would be the same
- Only the exceptions file would change, that's something that would have to be taken into consideration

zer0x64 commented 5 years ago

I have done a POC of the database and the server(available in the corresponding branch, but not intergrated to Scout yet). In our case, I would say that Flask might be a little limited and "hackish". Mostly, it's because it relies a lot on the global namespace and cannot be put into a class.

Also, it basically forces us to create a new "connection" to the database for each request without rewriting half of the framework(it would imply working with a separate thread and queues, which would break the reason we're using a framwork instead of http.server and probably break the server itself), which could be pretty inefficient in cases with a large amount of data involved.

I'm currently looking into other alternatives. Have other ideas?

zer0x64 commented 5 years ago

@j4v From what I've seen, another good candidate would be CherryPy, as it runs directly from python(flask and django theoratically needs to be start using their own executable and flask will complain if you don't) and already makes use of classes. http://docs.cherrypy.org/en/latest/ I'll just wait until your feedback before testing for the single-thread database connection issue, but this seems more resistant to manual threading then Flask. Flask looks like it's made to be used only as the main application while CherryPy looks like it can be embedded inside another application.

zer0x64 commented 5 years ago

CherryPy POC also available in the branch, and it looks MUCH cleaner IMO. We're still stuck with the single-thread database problem, but I'll get to work with this.

zer0x64 commented 5 years ago

Although, for the database, everyone seems to be doing only one request per connection. Am I doing it wrong here?

x4v13r64 commented 5 years ago

Although, for the database, everyone seems to be doing only one request per connection. Am I doing it wrong here?

In order for a relational DB to be consistent, it can only allow for one operation at a time (e.g. you can't allow two UPDATE operations to happen, or for an UPDATE and a DELETE to happen concurrently). Hence even with a multithreaded application, the DB itself will process operations sequentially. This explains it well.

I'll have a look at CherryPy.

For Flask I think you're right regarding it expecting to have a proper web server placed in front of it. Not sure about the classes though, Flask (& Django) are MVC frameworks - both allow Views to be classes.

x4v13r64 commented 5 years ago

Happy to see the progress made in https://github.com/nccgroup/ScoutSuite/pull/260. Have you also taken some time to look into the first option (not using a web server / db)? While using a server is most likely the best solution, it's also the most costly in terms of workload and architectural changes.

zer0x64 commented 5 years ago

Happy to see the progress made in #260. Have you also taken some time to look into the first option (not using a web server / db)? While using a server is most likely the best solution, it's also the most costly in terms of workload and architectural changes.

I haven't really messed around with the IndexedDb code-wise, but I've seen a bunch of reason why it might not be the best idea:

From what I've seen, on Firefox, the maximum storage a indexeDb can save for a single origin(URL) is about 10% of the available disk space(you might see 50mb floating around, but this is not true anymore) or 50% for the totality of the IndexedDB. When talking about really big infrastructure, it can become pretty annoying if you need 1Tb of available space for a 100GB report. Also, if you use a multi-disk system, chances are this limitation is imposed on the OS disk, which can be a problem if you have a small SSD and a large HDD.
Memory leaks: If you use host the report and access it via different URLs(different domain name of ips, for example) or file location(if you move it), it will be considered as a different origin and save in another location. The data would not be erased automatically when closing the report and the large amount of space used by the results will be lost until manually deleted of overwritten, since IndexedDb will erase older data when full. Also, fulling the IndexedDb with our report would probably cause side effects on other websites by deleting their data.
Paging: We already know paging is necessary when dealing with huge amount of data on the client side because it will be too slow for the browser to load everything at once and the user cannot really navigate in a page containning millions of entries. I fear that using the indexedDb will delegate the sorting and limit operations to the browser which may not be optimised as much as a real DBMS for this kind of task.
Partial runs: A database seems more suited for a partial run than a json file. For the database, you could simply capture the SIGINT event, wait for the current task to finish and commit to the database, and when continuing you don't have to load the entire result set of the first run(only the data that wiull be used after that). However, with a JSON, you must reload the entire file to re-dump it afterward.
Integrating into a full server: This is the counterpart to the "local report" argument. If we can get a working setup for a server hosting the result, it would be easier if eventually we want to serve the entire report on a webserver(might have some use can, I can think of running scout in a VM periodically for auditing purpose). I don't see an easy way to do this with a large json and an indexedDb.

While the only real reason I would see to use the IndexedDb is to save the hassle to start the server(using it would still require a human interaction to load the file into the indexedDB, right?) and the static report(which is, I agree, a really valid point)

x4v13r64 commented 5 years ago

All valid points :+1: . I also don't see any options (i.e. something other then indexeDb) which would contradict the above.

x4v13r64 commented 5 years ago

While https://github.com/nccgroup/ScoutSuite/pull/316 has been merged into develop (yay!) it's still experimental. Keeping this open for now.

x4v13r64 commented 4 years ago

Latest proposed approach:

Build this as a Django app
- Wouldn't need to alter the CLI args or anything. Just make the entrypoint a subclass of BaseCommand
- Shouldn't be too difficult to add a wrapper which still allows you to call Scout from the CLI
For the database, we can use sqlite as a nosql store:
- resources: id + json blob
- composite resources: resources + 1:many for children

nccgroup / ScoutSuite

Handle very large cloud accounts #226

nccgroup / ScoutSuite

Handle *very* large cloud accounts #226

Handle very large cloud accounts #226