nccgroup / ScoutSuite

Multi-Cloud Security Auditing Tool
GNU General Public License v2.0
6.63k stars 1.05k forks source link

Handle *very* large cloud accounts #226

Open vifor2 opened 5 years ago

vifor2 commented 5 years ago

Investigate and solve the following issue: Currently the fact that a JSON file is generated for each report and then its data is loaded onto the web browser is not an issue for the majority of cases. But when Scout Suite is used agaisn't a service that is being utilized to it's full potential and has, let's say, over a million ressources, the JSON generated is way too big and therefore crashes the browser once the user tries to open the report.

Possible solution ideas:

x4v13r64 commented 5 years ago

After much thought to this I don't think the second approach will work (so I'm leaning towards the first). If you do find a way then I'm all ears.

What the first approach would probably look like:

zer0x64 commented 5 years ago

I have done a POC of the database and the server(available in the corresponding branch, but not intergrated to Scout yet). In our case, I would say that Flask might be a little limited and "hackish". Mostly, it's because it relies a lot on the global namespace and cannot be put into a class.

Also, it basically forces us to create a new "connection" to the database for each request without rewriting half of the framework(it would imply working with a separate thread and queues, which would break the reason we're using a framwork instead of http.server and probably break the server itself), which could be pretty inefficient in cases with a large amount of data involved.

I'm currently looking into other alternatives. Have other ideas?

zer0x64 commented 5 years ago

@j4v From what I've seen, another good candidate would be CherryPy, as it runs directly from python(flask and django theoratically needs to be start using their own executable and flask will complain if you don't) and already makes use of classes. http://docs.cherrypy.org/en/latest/ I'll just wait until your feedback before testing for the single-thread database connection issue, but this seems more resistant to manual threading then Flask. Flask looks like it's made to be used only as the main application while CherryPy looks like it can be embedded inside another application.

zer0x64 commented 5 years ago

CherryPy POC also available in the branch, and it looks MUCH cleaner IMO. We're still stuck with the single-thread database problem, but I'll get to work with this.

zer0x64 commented 5 years ago

Although, for the database, everyone seems to be doing only one request per connection. Am I doing it wrong here?

x4v13r64 commented 5 years ago

Although, for the database, everyone seems to be doing only one request per connection. Am I doing it wrong here?

In order for a relational DB to be consistent, it can only allow for one operation at a time (e.g. you can't allow two UPDATE operations to happen, or for an UPDATE and a DELETE to happen concurrently). Hence even with a multithreaded application, the DB itself will process operations sequentially. This explains it well.

I'll have a look at CherryPy.

For Flask I think you're right regarding it expecting to have a proper web server placed in front of it. Not sure about the classes though, Flask (& Django) are MVC frameworks - both allow Views to be classes.

x4v13r64 commented 5 years ago

Happy to see the progress made in https://github.com/nccgroup/ScoutSuite/pull/260. Have you also taken some time to look into the first option (not using a web server / db)? While using a server is most likely the best solution, it's also the most costly in terms of workload and architectural changes.

zer0x64 commented 5 years ago

Happy to see the progress made in #260. Have you also taken some time to look into the first option (not using a web server / db)? While using a server is most likely the best solution, it's also the most costly in terms of workload and architectural changes.

I haven't really messed around with the IndexedDb code-wise, but I've seen a bunch of reason why it might not be the best idea:

  1. From what I've seen, on Firefox, the maximum storage a indexeDb can save for a single origin(URL) is about 10% of the available disk space(you might see 50mb floating around, but this is not true anymore) or 50% for the totality of the IndexedDB. When talking about really big infrastructure, it can become pretty annoying if you need 1Tb of available space for a 100GB report. Also, if you use a multi-disk system, chances are this limitation is imposed on the OS disk, which can be a problem if you have a small SSD and a large HDD.

  2. Memory leaks: If you use host the report and access it via different URLs(different domain name of ips, for example) or file location(if you move it), it will be considered as a different origin and save in another location. The data would not be erased automatically when closing the report and the large amount of space used by the results will be lost until manually deleted of overwritten, since IndexedDb will erase older data when full. Also, fulling the IndexedDb with our report would probably cause side effects on other websites by deleting their data.

  3. Paging: We already know paging is necessary when dealing with huge amount of data on the client side because it will be too slow for the browser to load everything at once and the user cannot really navigate in a page containning millions of entries. I fear that using the indexedDb will delegate the sorting and limit operations to the browser which may not be optimised as much as a real DBMS for this kind of task.

  4. Partial runs: A database seems more suited for a partial run than a json file. For the database, you could simply capture the SIGINT event, wait for the current task to finish and commit to the database, and when continuing you don't have to load the entire result set of the first run(only the data that wiull be used after that). However, with a JSON, you must reload the entire file to re-dump it afterward.

  5. Integrating into a full server: This is the counterpart to the "local report" argument. If we can get a working setup for a server hosting the result, it would be easier if eventually we want to serve the entire report on a webserver(might have some use can, I can think of running scout in a VM periodically for auditing purpose). I don't see an easy way to do this with a large json and an indexedDb.

While the only real reason I would see to use the IndexedDb is to save the hassle to start the server(using it would still require a human interaction to load the file into the indexedDB, right?) and the static report(which is, I agree, a really valid point)

x4v13r64 commented 5 years ago

All valid points :+1: . I also don't see any options (i.e. something other then indexeDb) which would contradict the above.

x4v13r64 commented 5 years ago

While https://github.com/nccgroup/ScoutSuite/pull/316 has been merged into develop (yay!) it's still experimental. Keeping this open for now.

x4v13r64 commented 4 years ago

Latest proposed approach: