Project dead? Need takeover?

merces / aleph

An Open Source Malware Analysis Pipeline System

157 stars 53 forks source link

Project dead? Need takeover? #72

Closed deadbits closed 5 years ago

deadbits commented 5 years ago

Just want to know if this is officially dead. If so, is it deprecated in favor of another project? Is it just lack of developers / time? is there anything the community can do to help?

The initial concept here has really solid potential and I'd hate to see it just disappear to time on GitHub. Lmk how I can help!

CC: @merces @jseidl @turicas

merces commented 5 years ago

Hi @deadbits!

Thank you very much for your offer. It's really appreciated!

I definitely believe in Aleph and the only reason it seems abandoned is indeed the lack of developers/time. Are you interested on leading the development here? What exactly do you have in mind?

Bringing up this discussion already helps. =)

deadbits commented 5 years ago

Well initially I have some ideas for a bit of everything. Is this the type of work you are as inline with the project? If so, I can take over leading development here or at the very least implement some new features and help with PRs and issues

General:

Migrate to Python 3 (EOL for 2.7 is about a year away)
creating tests
Pull Request and Issue templates
Tests before merging new PRs

Collectors (updated):

paste hunting via YARA signatures / regex
- Base64 PE header post, Hex encoded PE, PowerShell with suspicious keywords, etc
- these would go through decoder plugin and check if it's worth keeping the sample
Twitter monitoring for search keywords (hxxp, hashes from researchers tweets to fetch from VTI, opendir, dailyscriptlet, etc.)
VirusTotal Intelligence Hunting notifications
REST API submission endpoint for Collector
SQS, S3, and/or DigitalOcean Spaces watcher for Collector

Parsing & Enrichment (updated):

ability to parse Outlook messages for the Email monitor
Plugin to YARA scan any file type
adding more information to the PE model
- Check for commonly suspicious APIs
- Check for and verify Authenticode
adding an ELF and macho model
Enrichment of extracted IOCs (ASN, geoip, DNS resolution, etc)

Exporting:

export data to arbitrary REST endpoint
export JSON pipeline result to SQS, S3 of Digital Ocean spaces bucket so users can act on pipeline results anyway they want to
export JSON pipeline result on disk for use in systems other than Elasticsearch, or just plain old viewing the data

deadbits commented 5 years ago

Some more feature thoughts (updated) :

tokenizing emails and using those as passwords for protected attachments instead of the hard-coded list that exists now
add plug-ins to send files to different online/free or local sandbox services (hybrid, any.run, cuckoo)
submit url artifacts to URLScan.io? Idk if we'd want to send everything there
VBA extraction from Documents
- Attempted deoobfuscation
Check strings for highly suspicious keywords Living of the land binary references, etc.
Process scriptlet files for malicious indications and send through pipeline if found
- HTA, SCT, XML, WS, etc.
Reputation DB check for extracted URLs
Optional ability to allow for SOCKS5 proxy use for external requests

jseidl commented 5 years ago

Hi folks. These are all great ideas!

I am the original developer of aleph and indeed I started reworking the whole code from scratch using celery to make it more scalable. Unfortunately a bunch of stuff happened in the meantime and I had to put it on hold. Having more people working in the code is always appreciated.

I'll work with @merces to upload this code into a new branch and we can start from there. Sounds good?

Cheers

On Thu, Dec 27, 2018, 10:36 AM Adam M. Swanda <notifications@github.com wrote:

Another thought : tokenizing emails and using those as passwords for protected attachments instead of the hard-coded list that exists now

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-450206897, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRobJL_ywuQY3gP8NOcuxvFhTO2lXDks5u9RNFgaJpZM4ZiM-K .

deadbits commented 5 years ago

That sounds good to me. Thanks!

On Dec 27, 2018 at 11:06 PM, <Jan Seidl (mailto:notifications@github.com)> wrote:

Hi folks. These are all great ideas!

I am the original developer of aleph and indeed I started reworking the whole code from scratch using celery to make it more scalable. Unfortunately a bunch of stuff happened in the meantime and I had to put it on hold. Having more people working in the code is always appreciated.

I'll work with @merces to upload this code into a new branch and we can start from there. Sounds good?

Cheers

On Thu, Dec 27, 2018, 10:36 AM Adam M. Swanda <notifications@github.com wrote:

Another thought : tokenizing emails and using those as passwords for protected attachments instead of the hard-coded list that exists now

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-450206897, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRobJL_ywuQY3gP8NOcuxvFhTO2lXDks5u9RNFgaJpZM4ZiM-K .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub (https://github.com/merces/aleph/issues/72#issuecomment-450285755), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABRWFe9ZkIbo_Fc1Rx6QL5zrOS2G82bsks5u9ZjTgaJpZM4ZiM-K).

deadbits commented 5 years ago

I know we'll likely move this to another ticket, or several, but I put all my ideas into one list so it's easier to view instead of my comments above:

General

Migrate to Python 3
Create tests
Create Pull Request and Issue templates
Github integration tests in Pull Request

Samples Object

Add filenames list
Add first seen / last seen
- If file is seen more than once just update last seen
- filenames list can be updated too if available

Collectors

Collectors service could be separate so not all collectors are stopped if user stops the service
REST API endpoint
- Submit files directly
- Submit PCAP, extract files as sample
S3 bucket monitor
SQS Collector / Polling
DigitalOcean Spaces Collector
Twitter searches Collector
- Query and hashtag search of popular researchers and tags; extract hashes and URLs
- Try to fetch hashes from VTI/Hybrid
- Check if url is alive, download sample
Paste site monitor (this may be more trouble than its worth and leaning too far away from the Aleph purpose imho, but just a thought either way)
- Keywords
- Regex
- YARA
- Only keep pastes that look to be malware samples

Plugins

PE file parser enhancements
- Certificate
- Authenicode
- Suspicious API exports
- Anti-VM & Anti-Analysis checks
Accept and parse ELF and Macho files
Subfile extraction (hachoir-subfile + dd, etc)
Extract Base64 from files
- Decode and if interesting MIME type create Sample
Extract email attachments for Collectors
Outlook email parser
Office document parser
- VBA Macro extraction & deoobfuscation
YARA scan
ELF parser
Macho parser
Strings extraction enhancements
- Find interesting patterns (emails, URLs, IPs, domains, BTC address, phone number)
- Enrich extracted interesting patterns (hosts, URLs, IPs, emails)
VirusTotal plugin enhancements:
- Get report if it exists
- Optionally submit file for analysis if no report found (Aleph might be used in sensitive envs where not all files should be uploaded)
VTI daily check (requires paid account)
- Clear notifications once completed
HTML parser enhancements
- Extract links / urls
  - If URL matches pre-defined MIME types to keep, save as sample
  - Maybe crawl found links for more files with interesting MIME types and create a child Samples
Check hosts against reputation databases
- threatexpert, FireHOL lists, VT host check, ShadowServer whitelist check (There's too many choices to list)
Submit samples on free online sandboxes or local installation python-sanboxapi
Zip and GZip enhancements
- Tokenize emails to use passwords as brute force list
- Let user define list of keywords in a config file to try as passwords
Ability to define SOCKS5 proxy for web requests
Scriptlet file parsers (HTA, SCT, XML, WS, etc)
- Either as direct submission or as a child
- If as child, feed back to Collector for pipeline processing

Decoders (subset of plugin to run under certain conditions?)

Base64
Reverse Base64
Hex to binary decoding Note: These are based on the Paste scraper finding these types of encoded files

Export Options

Send to Elasticsearch
Send to Splunk
Save JSON to disk as storage
Send to S3 bucket as storage
Send to SQS so user can integrate with other systems and workflows
Send to DigitalOcean Spaces as storage

merces commented 5 years ago

Wow. This is a lot of good ideas indeed! Thanks for that, @deadbits!

@jseidl We can leverage the "Development" branch as this is not being used by anyone. Would you be able to upload this new code there by the end of next week? I think we should leverage the energy @deadbits is willing to put on it to start it as soon as possible. 🙂

Thank you all!

deadbits commented 5 years ago

@merces Definitely a lot of ideas, indeed heh I don't know how many fit with the direction of this project and imho some would be higher priority than others. Not to mention implementing all those would take quite some time.

There's also a handful of open-source Python libraries I have in mind to lean on for some of the ideas so it's not code written from scratch. Though, I think a decent amount of them are quick-wins while others require more major work.

Regardless, I'm definitely up to help out in any way I can, and working with you and @jseidl to figure out what should be kept or scrapped, what should be prioritized, etc., etc.

jseidl commented 5 years ago

There are a lot of good ideas, some of them are actually part of the current aleph state such as keeping track of the sources (different filenames) and stuff. I have all my V2 code in a bitbucket private repo because I was playing on having the core libs as a separate project, linked to the processor nodes app and collector nodes app via GIT but I'm still not sure on this approach. I think I have a presentation I did about my ideas for V2, I'll take a look over here and add the link to this topic.

The main idea of using celery is benefiting from celery's inherent scalability and pluggability which allows you to use many different backends for messaging (inter-process) and allows you to add nodes as per need (eg. add more collector nodes, or add more processor nodes as you need). The idea of having pluggable inputs/outputs and processor plugins is the core idea behind Aleph.

IIRC I had almost everything of the core functionality already ported to the V2 scheme but it isn't polished AT ALL. I indeed crave since the beginning for having TDD (unit tests) for every single module that we can.

I'm not a Github ninja but I'm afraid we need something else to keep track of all the ideas, what is going into each work "round" and etc. Any ideas? Can we use github by itself without polluting the current release? Don't want like the issues or ideas for the dev version showing up on the main project page etc.

At the top of my head, the idea for V2 was having the core relying on celery for distributed processing with a MQ backend, which would also serve as transport channel between collectors and the processing cluster. This would allow anyone to develop a collector, simply by having it in the end, posting the collected sample to the MQ in a given channel/topic. Also with a document database gatheing all our extracted data (such as elasticsearch), enables us to further datamine the crap out of it :). V2 does have pluggable output (like storing the samples in a sample vault, or simple filesystem, or amazon S3) planned.

In the end, the idea is for aleph to be a framework, with pluggable/interchangeable parts that share the same common interface.

I'll gather all the resources I have for V2 and get back on this thread so we can schedule a hangouts or something to discuss.

Cheers!

Jan Seidl

http://wroot.org http://www.linkedin.com/in/janseidl

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

On Wed, Jan 2, 2019 at 4:18 PM Adam M. Swanda notifications@github.com wrote:

@merces https://github.com/merces Definitely a lot of ideas, indeed heh I don't know how many fit with the direction of this project and imho some would be higher priority than others. Not to mention implementing all those would take quite some time. Though, I think a decent amount of them are quick-wins while others require more major work.

Regardless, I'm definitely up to helping out in any way I can, and working with you and @jseidl https://github.com/jseidl to figure out what should be kept or scraped, what should be prioritized, etc., etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451025510, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRoT8sGyXmlkMkfyyhrzWu_Ec2u-3oks5u_Uw-gaJpZM4ZiM-K .

deadbits commented 5 years ago

@jseidl As far as tracking "sprints" and work progress, tasks, etc.:

We could either use Github's built-in "Projects" feature that is enabled for each repo (I think you just need to enable it in the Repo settings page)
Something like Waffle.io which does very much the same thing but is more full featured and hooks with Github's API and has some other neat integrations.

Github projects (https://help.github.com/articles/about-project-boards/) can be restricted by the repo owner to who can view/edit the Issues on the board.

On Waffle.io I believe your project board is private (don't quote me on that), and there's a free tier https://waffle.io/pricing

Celery and MQ is definitely a perfect approach for what you're talking about too. I was even going to suggest ZMQ before I finished reading your post. I like everything I hear so far :)

A hangout would be great to link up and figure out the planning / Issues workflow for sure. I'm pretty flexible with my schedule so anytime that's good for all of you. I'm in the US on Eastern time- not sure where you all are but I'm sure we can figure something out.

--
Adam M. Swanda

PGP: https://keybase.io/deadbits

On Jan 2, 2019 at 7:36 PM, <Jan Seidl (mailto:notifications@github.com)> wrote:

There are a lot of good ideas, some of them are actually part of the current aleph state such as keeping track of the sources (different filenames) and stuff. I have all my V2 code in a bitbucket private repo because I was playing on having the core libs as a separate project, linked to the processor nodes app and collector nodes app via GIT but I'm still not sure on this approach. I think I have a presentation I did about my ideas for V2, I'll take a look over here and add the link to this topic.

The main idea of using celery is benefiting from celery's inherent scalability and pluggability which allows you to use many different backends for messaging (inter-process) and allows you to add nodes as per need (eg. add more collector nodes, or add more processor nodes as you need). The idea of having pluggable inputs/outputs and processor plugins is the core idea behind Aleph.

IIRC I had almost everything of the core functionality already ported to the V2 scheme but it isn't polished AT ALL. I indeed crave since the beginning for having TDD (unit tests) for every single module that we can.

I'm not a Github ninja but I'm afraid we need something else to keep track of all the ideas, what is going into each work "round" and etc. Any ideas? Can we use github by itself without polluting the current release? Don't want like the issues or ideas for the dev version showing up on the main project page etc.

At the top of my head, the idea for V2 was having the core relying on celery for distributed processing with a MQ backend, which would also serve as transport channel between collectors and the processing cluster. This would allow anyone to develop a collector, simply by having it in the end, posting the collected sample to the MQ in a given channel/topic. Also with a document database gatheing all our extracted data (such as elasticsearch), enables us to further datamine the crap out of it :). V2 does have pluggable output (like storing the samples in a sample vault, or simple filesystem, or amazon S3) planned.

In the end, the idea is for aleph to be a framework, with pluggable/interchangeable parts that share the same common interface.

I'll gather all the resources I have for V2 and get back on this thread so we can schedule a hangouts or something to discuss.

Cheers!

Jan Seidl

http://wroot.org http://www.linkedin.com/in/janseidl

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

On Wed, Jan 2, 2019 at 4:18 PM Adam M. Swanda notifications@github.com wrote:

@merces https://github.com/merces Definitely a lot of ideas, indeed heh I don't know how many fit with the direction of this project and imho some would be higher priority than others. Not to mention implementing all those would take quite some time. Though, I think a decent amount of them are quick-wins while others require more major work.

Regardless, I'm definitely up to helping out in any way I can, and working with you and @jseidl https://github.com/jseidl to figure out what should be kept or scraped, what should be prioritized, etc., etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451025510, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRoT8sGyXmlkMkfyyhrzWu_Ec2u-3oks5u_Uw-gaJpZM4ZiM-K .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub (https://github.com/merces/aleph/issues/72#issuecomment-451028246), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABRWFSRecpcFIJf6Lm5safPgdSqSw6glks5u_VBwgaJpZM4ZiM-K).

jseidl commented 5 years ago

Right on! Actually, it is planned to support ZMQ for standalone applications.

Thanks for the suggestions on projects and waffle, will definitely take a look.

I'm in the US on Pacific Time, have a pretty flexible schedule too. The other folks?

Jan Seidl

http://wroot.org http://www.linkedin.com/in/janseidl

On Wed, Jan 2, 2019 at 5:00 PM Adam M. Swanda notifications@github.com wrote:

@jseidl As far as tracking "sprints" and work progress, tasks, etc.:

We could either use Github's built-in "Projects" feature that is enabled for each repo (I think you just need to enable it in the Repo settings page)

Something like Waffle.io which does very much the same thing but is more full featured and hooks with Github's API and has some other neat integrations.

Github projects (https://help.github.com/articles/about-project-boards/) can be restricted by the repo owner to who can view/edit the Issues on the board.

On Waffle.io I believe your project board is private (don't quote me on that), and there's a free tier https://waffle.io/pricing

Celery and MQ is definitely a perfect approach for what you're talking about too. I was even going to suggest ZMQ before I finished reading your post. I like everything I hear so far :)

A hangout would be great to link up and figure out the planning / Issues workflow for sure. I'm pretty flexible with my schedule so anytime that's good for all of you. I'm in the US on Eastern time- not sure where you all are but I'm sure we can figure something out.

-- Adam M. Swanda

PGP: https://keybase.io/deadbits

On Jan 2, 2019 at 7:36 PM, <Jan Seidl (mailto:notifications@github.com)> wrote:

There are a lot of good ideas, some of them are actually part of the current aleph state such as keeping track of the sources (different filenames) and stuff. I have all my V2 code in a bitbucket private repo because I was playing on having the core libs as a separate project, linked to the processor nodes app and collector nodes app via GIT but I'm still not sure on this approach. I think I have a presentation I did about my ideas for V2, I'll take a look over here and add the link to this topic.

The main idea of using celery is benefiting from celery's inherent scalability and pluggability which allows you to use many different backends for messaging (inter-process) and allows you to add nodes as per need (eg. add more collector nodes, or add more processor nodes as you need). The idea of having pluggable inputs/outputs and processor plugins is the core idea behind Aleph.

IIRC I had almost everything of the core functionality already ported to the V2 scheme but it isn't polished AT ALL. I indeed crave since the beginning for having TDD (unit tests) for every single module that we can.

I'm not a Github ninja but I'm afraid we need something else to keep track of all the ideas, what is going into each work "round" and etc. Any ideas? Can we use github by itself without polluting the current release? Don't want like the issues or ideas for the dev version showing up on the main project page etc.

At the top of my head, the idea for V2 was having the core relying on celery for distributed processing with a MQ backend, which would also serve as transport channel between collectors and the processing cluster. This would allow anyone to develop a collector, simply by having it in the end, posting the collected sample to the MQ in a given channel/topic. Also with a document database gatheing all our extracted data (such as elasticsearch), enables us to further datamine the crap out of it :). V2 does have pluggable output (like storing the samples in a sample vault, or simple filesystem, or amazon S3) planned.

In the end, the idea is for aleph to be a framework, with pluggable/interchangeable parts that share the same common interface.

I'll gather all the resources I have for V2 and get back on this thread so we can schedule a hangouts or something to discuss.

Cheers!

Jan Seidl

http://wroot.org http://www.linkedin.com/in/janseidl

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

On Wed, Jan 2, 2019 at 4:18 PM Adam M. Swanda notifications@github.com wrote:

@merces https://github.com/merces Definitely a lot of ideas, indeed heh I don't know how many fit with the direction of this project and imho some would be higher priority than others. Not to mention implementing all those would take quite some time. Though, I think a decent amount of them are quick-wins while others require more major work.

Regardless, I'm definitely up to helping out in any way I can, and working with you and @jseidl https://github.com/jseidl to figure out what should be kept or scraped, what should be prioritized, etc., etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451025510, or mute the thread < https://github.com/notifications/unsubscribe-auth/AARRoT8sGyXmlkMkfyyhrzWu_Ec2u-3oks5u_Uw-gaJpZM4ZiM-K

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( https://github.com/merces/aleph/issues/72#issuecomment-451028246), or mute the thread ( https://github.com/notifications/unsubscribe-auth/ABRWFSRecpcFIJf6Lm5safPgdSqSw6glks5u_VBwgaJpZM4ZiM-K ).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451031815, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRoXMk6c4eWw32HqRzgCgzXbY8BmJ6ks5u_VZIgaJpZM4ZiM-K .

jseidl commented 5 years ago

Hi folks,

After extensive digging, I've found the presentation. I think the thought process hasn't changed from whem this document was created, except that instead of using celery just for the scheduled tasks, the processor daemon itself will be run on celery to ease deployment and worker control, possibly with auto scaling. Also I'd like to all the collectors fist save the sample locally then consume the local file into the transport to avoid losing the sample in case connection fails abruptly or something else weird happens during collection.

On the processor side, on starting up reprocess samples left in the temp dir, delete from temp dir only when making sure all data is stored on the backends.

I never attached a PDF into a github thread over email so if the attachment doesn't upload I'll host somewhere and put the link in a next entry.

Cheers!

Jan Seidl

http://wroot.org http://www.linkedin.com/in/janseidl

On Wed, Jan 2, 2019 at 5:03 PM Jan Seidl janseidl@gmail.com wrote:

Right on! Actually, it is planned to support ZMQ for standalone applications.

Thanks for the suggestions on projects and waffle, will definitely take a look.

I'm in the US on Pacific Time, have a pretty flexible schedule too. The other folks?

Jan Seidl

http://wroot.org http://www.linkedin.com/in/janseidl

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

On Wed, Jan 2, 2019 at 5:00 PM Adam M. Swanda notifications@github.com wrote:

@jseidl As far as tracking "sprints" and work progress, tasks, etc.:

We could either use Github's built-in "Projects" feature that is enabled for each repo (I think you just need to enable it in the Repo settings page)

Something like Waffle.io which does very much the same thing but is more full featured and hooks with Github's API and has some other neat integrations.

Github projects (https://help.github.com/articles/about-project-boards/) can be restricted by the repo owner to who can view/edit the Issues on the board.

On Waffle.io I believe your project board is private (don't quote me on that), and there's a free tier https://waffle.io/pricing

Celery and MQ is definitely a perfect approach for what you're talking about too. I was even going to suggest ZMQ before I finished reading your post. I like everything I hear so far :)

A hangout would be great to link up and figure out the planning / Issues workflow for sure. I'm pretty flexible with my schedule so anytime that's good for all of you. I'm in the US on Eastern time- not sure where you all are but I'm sure we can figure something out.

-- Adam M. Swanda

PGP: https://keybase.io/deadbits

On Jan 2, 2019 at 7:36 PM, <Jan Seidl (mailto:notifications@github.com)> wrote:

There are a lot of good ideas, some of them are actually part of the current aleph state such as keeping track of the sources (different filenames) and stuff. I have all my V2 code in a bitbucket private repo because I was playing on having the core libs as a separate project, linked to the processor nodes app and collector nodes app via GIT but I'm still not sure on this approach. I think I have a presentation I did about my ideas for V2, I'll take a look over here and add the link to this topic.

The main idea of using celery is benefiting from celery's inherent scalability and pluggability which allows you to use many different backends for messaging (inter-process) and allows you to add nodes as per need (eg. add more collector nodes, or add more processor nodes as you need). The idea of having pluggable inputs/outputs and processor plugins is the core idea behind Aleph.

IIRC I had almost everything of the core functionality already ported to the V2 scheme but it isn't polished AT ALL. I indeed crave since the beginning for having TDD (unit tests) for every single module that we can.

I'm not a Github ninja but I'm afraid we need something else to keep track of all the ideas, what is going into each work "round" and etc. Any ideas? Can we use github by itself without polluting the current release? Don't want like the issues or ideas for the dev version showing up on the main project page etc.

At the top of my head, the idea for V2 was having the core relying on celery for distributed processing with a MQ backend, which would also serve as transport channel between collectors and the processing cluster. This would allow anyone to develop a collector, simply by having it in the end, posting the collected sample to the MQ in a given channel/topic. Also with a document database gatheing all our extracted data (such as elasticsearch), enables us to further datamine the crap out of it :). V2 does have pluggable output (like storing the samples in a sample vault, or simple filesystem, or amazon S3) planned.

In the end, the idea is for aleph to be a framework, with pluggable/interchangeable parts that share the same common interface.

I'll gather all the resources I have for V2 and get back on this thread so we can schedule a hangouts or something to discuss.

Cheers!

Jan Seidl

http://wroot.org http://www.linkedin.com/in/janseidl

This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately by e-mail if you have received this e-mail by mistake and delete this e-mail from your system. E-mail transmission cannot be guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The sender therefore does not accept liability for any errors or omissions in the contents of this message, which arise as a result of e-mail transmission. If verification is required please request a hard-copy version.

On Wed, Jan 2, 2019 at 4:18 PM Adam M. Swanda <notifications@github.com

wrote:

@merces https://github.com/merces Definitely a lot of ideas, indeed heh I don't know how many fit with the direction of this project and imho some would be higher priority than others. Not to mention implementing all those would take quite some time. Though, I think a decent amount of them are quick-wins while others require more major work.

Regardless, I'm definitely up to helping out in any way I can, and working with you and @jseidl https://github.com/jseidl to figure out what should be kept or scraped, what should be prioritized, etc., etc.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451025510, or mute the thread < https://github.com/notifications/unsubscribe-auth/AARRoT8sGyXmlkMkfyyhrzWu_Ec2u-3oks5u_Uw-gaJpZM4ZiM-K

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( https://github.com/merces/aleph/issues/72#issuecomment-451028246), or mute the thread ( https://github.com/notifications/unsubscribe-auth/ABRWFSRecpcFIJf6Lm5safPgdSqSw6glks5u_VBwgaJpZM4ZiM-K ).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451031815, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRoXMk6c4eWw32HqRzgCgzXbY8BmJ6ks5u_VZIgaJpZM4ZiM-K .

jseidl commented 5 years ago

Ok, attaching PDF from mail didn't work. Uploaded to my Drive here: https://drive.google.com/open?id=1lvNFhJcguHfLgXHm865XXWVnfahTQcOA

deadbits commented 5 years ago

Also I'd like to all the collectors fist save the sample locally then consume the local file into the transport to avoid losing the sample in case connection fails abruptly or something else weird happens during collection. ... On the processor side, on starting up reprocess samples left in the temp dir, delete from temp dir only when making sure all data is stored on the backends.

I built a project similar to this architecture and this is definitely the best approach. I'm guessing you're already planning this but storing locally by hash is a solid way to avoid collisions (instead of uuid4 or what not)

Basically:

Receive sample from wherever
Store locally with a unique file name
Put the file into transport
- When you're sure it's stored on the backend DB (or at least accepted by the Consumer as an Object), delete it locally

@jseidl We can schedule some time to sync up next week maybe? Or this weekend even if that works for you. My weekday evenings are typically open, tomorrow I'm out most of the afternoon. Outside of that, I'm ready to get rolling 🚀

jseidl commented 5 years ago

Sure Adam, anytime after 11am pacific works for me.

Using UUID4 for ID was something I was testing and I'm quite regretful hehehe. Yes we should name the local files as per hashing.

Jan Seidl

http://wroot.org http://www.linkedin.com/in/janseidl

On Fri, Jan 4, 2019 at 5:14 PM Adam M. Swanda notifications@github.com wrote:

Also I'd like to all the collectors fist save the sample locally then consume the local file into the transport to avoid losing the sample in case connection fails abruptly or something else weird happens during collection. ... On the processor side, on starting up reprocess samples left in the temp dir, delete from temp dir only when making sure all data is stored on the backends.

I built a project similar to this architecture and this is definitely the best approach. I'm guessing you're already planning this but storing locally by hash is a solid way to avoid collisions (instead of uuid4 or what not)

Basically:

Receive sample from wherever

Store locally with a unique file name

Put the file into transport

When you're sure it's stored on the backend DB (or at least accepted by the Consumer as an Object), delete it locally

@jseidl https://github.com/jseidl We can schedule some time to sync up next week maybe? Or this weekend even if that works for you. My weekday evenings are typically open, tomorrow I'm out most of the afternoon. Outside of that, I'm ready to get rolling 🚀

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451614205, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRoZwjvGFlj6VVwC2kDP9i1Rhw6lqPks5u__xugaJpZM4ZiM-K .

deadbits commented 5 years ago

Read through your presentation last night - good stuff! Overall it sounds like a really solid framework and the ideas on how to scale it, create the components separately, etc., are all awesome.

I saw you had that plugins would "run in order". I might have misread or skipped a part but is the idea for plugins to run one a a time on any given Processor, or would plugins for a MIME type run in parallel via threading/multiprocessing?

These are thoughts for way down the road but just had it on my mind after reading your PDF: Another idea could be to have the plugins have an order of execution per MIME type, so each plugin can act on the results of the last. For example, maybe a Zip file comes in so it hits the "brute_zip" plugin, inside is an executable so the "yara_scan" plugin runs; the results of "yara_scan" says that the executable is Trojan ABC- so the "malware_decoder" plugin runs, and then "extract_iocs" runs on the results on malware_decoder, and so on... That way you get the results of all the plugins still but get to provide deeper levels of context as opposed to say: If file == EXE run strings and extract_iocs, sort of thing

Basically sending files down different plugins "paths" depending on their MIME type and any useful information from the previous plugin.

Also, The malware framework FAME also has a pretty cool feature for their plugins where a plugin inheriting the base class can use "acts_on", "generates", "triggered_by" and a few others. It's an interesting idea that might be useful to think on how to implement something similar. "generates" alerts of various types, or "triggered_by" another module in my example above https://github.com/certsocietegenerale/fame/blob/ab0e9cc3640b2337dbd873a41e03987ba1ba8035/docs/modules.rst#scope

jseidl commented 5 years ago

Hi Adam.

At the current state the plugin system has I s own "acts on" which are the mimetype of the files it should, well, act on. The triggered by idea is very good we should use as well.

The plugin chaining idea was for future proofing since the idea of having a separate class of plug-ins for intelligence already solves the immediate need for chaining.

As everything will be celery, plug-ins could call the depending plugin directly, async, and it would be run by another worker node (iirc) thus not hogging that much the pipeline. I don't know. This particular feature definitely needs more thinking.

Ping me directly so we can schedule our talk, Cheers!

On Sat, Jan 5, 2019, 5:50 AM Adam M. Swanda <notifications@github.com wrote:

Read through your presentation last night - good stuff! Overall it sounds like a really solid framework and the ideas on how to scale it, create the components separately, etc., are all awesome.

I saw you had that plugins would "run in order". I might have misread or skipped a part but is the idea for plugins to run one a a time on any given Processor, or would plugins for a MIME type run in parallel via threading/multiprocessing?

These are thoughts for way down the road but just had it on my mind after reading your PDF: Another idea could be to have the plugins have an order of execution per MIME type, so each plugin can act on the results of the last. For example, maybe a Zip file comes in so it hits the "brute_zip" plugin, inside is an executable so the "yara_scan" plugin runs; the results of "yara_scan" says that the executable is Trojan ABC- so the "malware_decoder" plugin runs, and then "extract_iocs" runs on the results on malware_decoder, and so on... That way you get the results of all the plugins still but get to provide deeper levels of context as opposed to say: If file == EXE run strings and extract_iocs, sort of thing

Basically sending files down different plugins "paths" depending on their MIME type and any useful information from the previous plugin.

Also, The malware framework FAME also has a pretty cool feature for their plugins where a plugin inheriting the base class can use "acts_on", "generates", "triggered_by" and a few others. It's an interesting idea that might be useful to think on how to implement something similar. "generates" alerts of various types, or "triggered_by" another module in my example above

https://github.com/certsocietegenerale/fame/blob/ab0e9cc3640b2337dbd873a41e03987ba1ba8035/docs/modules.rst#scope

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451656787, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRoWCBHqGaYig-NVDQzpHegwE71kWsks5vAK2kgaJpZM4ZiM-K .

deadbits commented 5 years ago

Jan,

How's tomorrow sometime between 1PM and 4PM Eastern?

Otherwise I'm open essentially anytime after 5-6pm on weekdays. I'm kind of packed today being out and about.

Wire, Google Meet (of Duo) works for me, or there's https://appear.in/ which is pretty solid too. Just lmk what time works best for you and we can just directly email invite details.

If @merces wants to be in on the call too, that'd be great. You both have much more knowledge of the current state of things than me heh

Thanks! Hope to chat soon! --
Adam M. Swanda

PGP: https://keybase.io/deadbits

On Jan 5, 2019 at 3:29 PM, <Jan Seidl (mailto:notifications@github.com)> wrote:

Hi Adam.

At the current state the plugin system has I s own "acts on" which are the mimetype of the files it should, well, act on. The triggered by idea is very good we should use as well.

The plugin chaining idea was for future proofing since the idea of having a separate class of plug-ins for intelligence already solves the immediate need for chaining.

As everything will be celery, plug-ins could call the depending plugin directly, async, and it would be run by another worker node (iirc) thus not hogging that much the pipeline. I don't know. This particular feature definitely needs more thinking.

Ping me directly so we can schedule our talk, Cheers!

On Sat, Jan 5, 2019, 5:50 AM Adam M. Swanda <notifications@github.com wrote:

Read through your presentation last night - good stuff! Overall it sounds like a really solid framework and the ideas on how to scale it, create the components separately, etc., are all awesome.

I saw you had that plugins would "run in order". I might have misread or skipped a part but is the idea for plugins to run one a a time on any given Processor, or would plugins for a MIME type run in parallel via threading/multiprocessing?

These are thoughts for way down the road but just had it on my mind after reading your PDF: Another idea could be to have the plugins have an order of execution per MIME type, so each plugin can act on the results of the last. For example, maybe a Zip file comes in so it hits the "brute_zip" plugin, inside is an executable so the "yara_scan" plugin runs; the results of "yara_scan" says that the executable is Trojan ABC- so the "malware_decoder" plugin runs, and then "extract_iocs" runs on the results on malware_decoder, and so on... That way you get the results of all the plugins still but get to provide deeper levels of context as opposed to say: If file == EXE run strings and extract_iocs, sort of thing

Basically sending files down different plugins "paths" depending on their MIME type and any useful information from the previous plugin.

Also, The malware framework FAME also has a pretty cool feature for their plugins where a plugin inheriting the base class can use "acts_on", "generates", "triggered_by" and a few others. It's an interesting idea that might be useful to think on how to implement something similar. "generates" alerts of various types, or "triggered_by" another module in my example above

https://github.com/certsocietegenerale/fame/blob/ab0e9cc3640b2337dbd873a41e03987ba1ba8035/docs/modules.rst#scope

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451656787, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRoWCBHqGaYig-NVDQzpHegwE71kWsks5vAK2kgaJpZM4ZiM-K .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub (https://github.com/merces/aleph/issues/72#issuecomment-451687974), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABRWFeOnU79u-9hCnFSpA0CKGyATz-4Pks5vAQsmgaJpZM4ZiM-K).

jseidl commented 5 years ago

Hi Adam. 3pm works best for me (12pm here pdt). I'm up for duo.

On Sat, Jan 5, 2019, 12:55 PM Adam M. Swanda <notifications@github.com wrote:

Jan,

How's tomorrow sometime between 1PM and 4PM Eastern?

Otherwise I'm open essentially anytime after 5-6pm on weekdays. I'm kind of packed today being out and about.

Wire, Google Meet (of Duo) works for me, or there's https://appear.in/ which is pretty solid too. Just lmk what time works best for you and we can just directly email invite details.

If @merces wants to be in on the call too, that'd be great. You both have much more knowledge of the current state of things than me heh

Thanks! Hope to chat soon!

Adam M. Swanda

PGP: https://keybase.io/deadbits

On Jan 5, 2019 at 3:29 PM, <Jan Seidl (mailto:notifications@github.com)> wrote:

Hi Adam.

At the current state the plugin system has I s own "acts on" which are the mimetype of the files it should, well, act on. The triggered by idea is very good we should use as well.

The plugin chaining idea was for future proofing since the idea of having a separate class of plug-ins for intelligence already solves the immediate need for chaining.

As everything will be celery, plug-ins could call the depending plugin directly, async, and it would be run by another worker node (iirc) thus not hogging that much the pipeline. I don't know. This particular feature definitely needs more thinking.

Ping me directly so we can schedule our talk, Cheers!

On Sat, Jan 5, 2019, 5:50 AM Adam M. Swanda <notifications@github.com wrote:

Read through your presentation last night - good stuff! Overall it sounds like a really solid framework and the ideas on how to scale it, create the components separately, etc., are all awesome.

I saw you had that plugins would "run in order". I might have misread or skipped a part but is the idea for plugins to run one a a time on any given Processor, or would plugins for a MIME type run in parallel via threading/multiprocessing?

These are thoughts for way down the road but just had it on my mind after reading your PDF: Another idea could be to have the plugins have an order of execution per MIME type, so each plugin can act on the results of the last. For example, maybe a Zip file comes in so it hits the "brute_zip" plugin, inside is an executable so the "yara_scan" plugin runs; the results of "yara_scan" says that the executable is Trojan ABC- so the "malware_decoder" plugin runs, and then "extract_iocs" runs on the results on malware_decoder, and so on... That way you get the results of all the plugins still but get to provide deeper levels of context as opposed to say: If file == EXE run strings and extract_iocs, sort of thing

Basically sending files down different plugins "paths" depending on their MIME type and any useful information from the previous plugin.

Also, The malware framework FAME also has a pretty cool feature for their plugins where a plugin inheriting the base class can use "acts_on", "generates", "triggered_by" and a few others. It's an interesting idea that might be useful to think on how to implement something similar. "generates" alerts of various types, or "triggered_by" another module in my example above

https://github.com/certsocietegenerale/fame/blob/ab0e9cc3640b2337dbd873a41e03987ba1ba8035/docs/modules.rst#scope

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451656787, or mute the thread < https://github.com/notifications/unsubscribe-auth/AARRoWCBHqGaYig-NVDQzpHegwE71kWsks5vAK2kgaJpZM4ZiM-K

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( https://github.com/merces/aleph/issues/72#issuecomment-451687974), or mute the thread ( https://github.com/notifications/unsubscribe-auth/ABRWFeOnU79u-9hCnFSpA0CKGyATz-4Pks5vAQsmgaJpZM4ZiM-K ).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451689616, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRod8R4b6m2xuraDcIQUfXo-KBPsGAks5vARE6gaJpZM4ZiM-K .

deadbits commented 5 years ago

Great! I am driving at the moment but when I get stationary I’ll shoot you an email with an invite

--
Adam M. Swanda

PGP: https://keybase.io/deadbits

On Jan 5, 2019 at 3:58 PM, <Jan Seidl (mailto:notifications@github.com)> wrote:

Hi Adam. 3pm works best for me (12pm here pdt). I'm up for duo.

On Sat, Jan 5, 2019, 12:55 PM Adam M. Swanda <notifications@github.com wrote:

Jan,

How's tomorrow sometime between 1PM and 4PM Eastern?

Otherwise I'm open essentially anytime after 5-6pm on weekdays. I'm kind of packed today being out and about.

Wire, Google Meet (of Duo) works for me, or there's https://appear.in/ which is pretty solid too. Just lmk what time works best for you and we can just directly email invite details.

If @merces wants to be in on the call too, that'd be great. You both have much more knowledge of the current state of things than me heh

Thanks! Hope to chat soon!

Adam M. Swanda

PGP: https://keybase.io/deadbits

On Jan 5, 2019 at 3:29 PM, <Jan Seidl (mailto:notifications@github.com)> wrote:

Hi Adam.

At the current state the plugin system has I s own "acts on" which are the mimetype of the files it should, well, act on. The triggered by idea is very good we should use as well.

The plugin chaining idea was for future proofing since the idea of having a separate class of plug-ins for intelligence already solves the immediate need for chaining.

As everything will be celery, plug-ins could call the depending plugin directly, async, and it would be run by another worker node (iirc) thus not hogging that much the pipeline. I don't know. This particular feature definitely needs more thinking.

Ping me directly so we can schedule our talk, Cheers!

On Sat, Jan 5, 2019, 5:50 AM Adam M. Swanda <notifications@github.com wrote:

Read through your presentation last night - good stuff! Overall it sounds like a really solid framework and the ideas on how to scale it, create the components separately, etc., are all awesome.

I saw you had that plugins would "run in order". I might have misread or skipped a part but is the idea for plugins to run one a a time on any given Processor, or would plugins for a MIME type run in parallel via threading/multiprocessing?

These are thoughts for way down the road but just had it on my mind after reading your PDF: Another idea could be to have the plugins have an order of execution per MIME type, so each plugin can act on the results of the last. For example, maybe a Zip file comes in so it hits the "brute_zip" plugin, inside is an executable so the "yara_scan" plugin runs; the results of "yara_scan" says that the executable is Trojan ABC- so the "malware_decoder" plugin runs, and then "extract_iocs" runs on the results on malware_decoder, and so on... That way you get the results of all the plugins still but get to provide deeper levels of context as opposed to say: If file == EXE run strings and extract_iocs, sort of thing

Basically sending files down different plugins "paths" depending on their MIME type and any useful information from the previous plugin.

Also, The malware framework FAME also has a pretty cool feature for their plugins where a plugin inheriting the base class can use "acts_on", "generates", "triggered_by" and a few others. It's an interesting idea that might be useful to think on how to implement something similar. "generates" alerts of various types, or "triggered_by" another module in my example above

https://github.com/certsocietegenerale/fame/blob/ab0e9cc3640b2337dbd873a41e03987ba1ba8035/docs/modules.rst#scope

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451656787, or mute the thread < https://github.com/notifications/unsubscribe-auth/AARRoWCBHqGaYig-NVDQzpHegwE71kWsks5vAK2kgaJpZM4ZiM-K

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub ( https://github.com/merces/aleph/issues/72#issuecomment-451687974), or mute the thread ( https://github.com/notifications/unsubscribe-auth/ABRWFeOnU79u-9hCnFSpA0CKGyATz-4Pks5vAQsmgaJpZM4ZiM-K ).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/merces/aleph/issues/72#issuecomment-451689616, or mute the thread https://github.com/notifications/unsubscribe-auth/AARRod8R4b6m2xuraDcIQUfXo-KBPsGAks5vARE6gaJpZM4ZiM-K .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub (https://github.com/merces/aleph/issues/72#issuecomment-451689804), or mute the thread (https://github.com/notifications/unsubscribe-auth/ABRWFeOfONkhePzlKngOf1W6E55NKdS8ks5vARHlgaJpZM4ZiM-K).

deadbits commented 5 years ago

@jseidl we'll have to use Meet since Duo is mobile only and doesn't support screen share etc. I just need your email address or you can send me an invite to adam@deadbits.org for today at 3PM Eastern if that still works

deadbits commented 5 years ago

We can probably close this at this point 😏