vipyrsec / dragonfly

A combined C2 and malware scanning service focused on the early identification, analysis, and reporting of malicious packages on the Python Package Index
MIT License
0 stars 0 forks source link

Message queues #19

Open jonathan-d-zhang opened 11 months ago

jonathan-d-zhang commented 11 months ago

Tracking issue for issues for implementing message queue

import-pandas-as-numpy commented 11 months ago

Minimum project spec for rabbitMQ integration:

Rationale: The premise is to keep work off client nodes as much as possible. Sending duplicate information to the client and relying on the server side deduplication represents an enormous amount of wasted compute. This deduplication must occur prior to a client ever interfacing with a job, and an effort should be made to address a robust number of edge cases.


Robin5605 commented 10 months ago

Some thoughts on authentication - We can use RabbitMQ's built in Authentication, Authorization, and Access Control feature to provision each client with a username/password combination.

All clients should be restricted to basic.publish on the results queue, and basic.consume on the incoming jobs queue. Mainframe should also have provisioned credentials with only basic.consume on the results queue Loader should have basic.publish on the incoming jobs queue

Robin5605 commented 10 months ago

On further thought - is there a need for a return queue? Can clients simply POST their results directly to the Dragonfly API? I assume since they will have to interface with the API anyway to fetch their ruleset when they detect they're out of date, they may as well POST results directly the API

import-pandas-as-numpy commented 10 months ago

On further thought - is there a need for a return queue? Can clients simply POST their results directly to the Dragonfly API? I assume since they will have to interface with the API anyway to fetch their ruleset when they detect they're out of date, they may as well POST results directly the API

Having a return queue likely helps alleviate situations where many clients are scanning many packages (and the API cannot keep up) but I'm not sure that's a goal worth aspiring to right now. It's definitely an effective future-proof scenario though.

The intention when we discussed this was that clients would POST directly to the API anyway. Clients bouncing a single request for rules off the API itself wasn't something I had really considered-- typically these rules are queued with the current ruleset SHA correct? I don't see that being an issue, but I'd be a little concerned that if we ever spin up multiple clients, and they're moving through packages quickly, we would be making potentially hundreds of requests to this endpoint per second. Not that it's serving that much, and I don't anticipate it would be an issue, but it bares mentioning nonetheless. If we put it behind any sort of rate limiting, we'll likely footgun ourselves in that regard.