wasted / netflow

Scala/Netty Netflow Collector used at wasted.io
http://netflow.io
Other
84 stars 31 forks source link

Storage Abstraction #12

Closed ruckc closed 9 years ago

ruckc commented 9 years ago

Just wondering if there is any chance of creating a storage abstraction layer?

It appears you have fairly decent netflow receiving/parsing, but your tightly coupled with Cassandra. I was looking for a modern Netflow parsing library for embedding into a troubleshooting tool.

fbettag commented 9 years ago

Sorry for the delayed reply, i'm on my honeymoon in the maldives currently.

What kind of backend are you looking for? We were in the process of remoddeling the storage part (yet) again, but i haven't had sny groundbreaking ideas so far.

Thanks for the compliment on our parser/receiver :)

Sent from my iPhone

On 16.03.2015, at 18:17, Curtis Ruck notifications@github.com wrote:

Just wondering if there is any chance of creating a storage abstraction layer?

It appears you have fairly decent netflow receiving/parsing, but your tightly coupled with Cassandra. I was looking for a modern Netflow parsing library for embedding into a troubleshooting tool.

— Reply to this email directly or view it on GitHub.

ruckc commented 9 years ago

Elasticsearch due to it's ease of integration as an embedded or distributed tool.

fbettag commented 9 years ago

Hmmmm. Interesting. I tried solr a year ago and it died after like 5 minutes of ingestion at ~20gbit while only ingesting a portion.

If you could gimme a Schema for ES you'd like, i'd be happy to give it a go after i return from my honeymoon end of March.

Sent from my iPhone

On 16.03.2015, at 22:52, Curtis Ruck notifications@github.com wrote:

Elasticsearch due to it's ease of integration as an embedded or distributed tool.

— Reply to this email directly or view it on GitHub.

ruckc commented 9 years ago

For 20gbit, i'd suggest spinning up a decent elasticsearch cluster. I currently store 4000 small 1-2k json documents per second on a single node elasticsearch instance. For the most part it scales fairly horizontal.

As for the schema, i'd just map the netflow data directly into json objects. But I would prefer a more generic approach with an interface, and separate implementations.

On Mon, Mar 16, 2015 at 2:02 PM Franz Bettag notifications@github.com wrote:

Hmmmm. Interesting. I tried solr a year ago and it died after like 5 minutes of ingestion at ~20gbit while only ingesting a portion.

If you could gimme a Schema for ES you'd like, i'd be happy to give it a go after i return from my honeymoon end of March.

Sent from my iPhone

On 16.03.2015, at 22:52, Curtis Ruck notifications@github.com wrote:

Elasticsearch due to it's ease of integration as an embedded or distributed tool.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/wasted/netflow/issues/12#issuecomment-81846081.

fbettag commented 9 years ago

Yeah for us that’s not gonna happen ;) “a decent size cluster” just for 20gbit when 3 cassandra nodes do :P

But i have an idea on how to make that particular thing fit your needs. Can you gimme roughly 14 days?

best regards

Franz

On 17 Mar 2015, at 00:19, Curtis Ruck notifications@github.com wrote:

For 20gbit, i'd suggest spinning up a decent elasticsearch cluster. I currently store 4000 small 1-2k json documents per second on a single node elasticsearch instance. For the most part it scales fairly horizontal.

As for the schema, i'd just map the netflow data directly into json objects. But I would prefer a more generic approach with an interface, and separate implementations.

On Mon, Mar 16, 2015 at 2:02 PM Franz Bettag notifications@github.com wrote:

Hmmmm. Interesting. I tried solr a year ago and it died after like 5 minutes of ingestion at ~20gbit while only ingesting a portion.

If you could gimme a Schema for ES you'd like, i'd be happy to give it a go after i return from my honeymoon end of March.

Sent from my iPhone

On 16.03.2015, at 22:52, Curtis Ruck notifications@github.com wrote:

Elasticsearch due to it's ease of integration as an embedded or distributed tool.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/wasted/netflow/issues/12#issuecomment-81846081.

— Reply to this email directly or view it on GitHub.

ruckc commented 9 years ago

I'm flexible, this is just a shiny toy idea that might one day be cool.

As for cluster size, it really depends on flows per second and disk/cpu availability. My 4000 msg/s rate is on small text documents that the lucene indexer has to split out. Indexing netflow data should be much simpler as it doesn't need tokens broken out. On Tue, Mar 17, 2015 at 5:20 PM Franz Bettag notifications@github.com wrote:

Yeah for us that’s not gonna happen ;) “a decent size cluster” just for 20gbit when 3 cassandra nodes do :P

But i have an idea on how to make that particular thing fit your needs. Can you gimme roughly 14 days?

best regards

Franz

On 17 Mar 2015, at 00:19, Curtis Ruck notifications@github.com wrote:

For 20gbit, i'd suggest spinning up a decent elasticsearch cluster. I currently store 4000 small 1-2k json documents per second on a single node elasticsearch instance. For the most part it scales fairly horizontal.

As for the schema, i'd just map the netflow data directly into json objects. But I would prefer a more generic approach with an interface, and separate implementations.

On Mon, Mar 16, 2015 at 2:02 PM Franz Bettag notifications@github.com wrote:

Hmmmm. Interesting. I tried solr a year ago and it died after like 5 minutes of ingestion at ~20gbit while only ingesting a portion.

If you could gimme a Schema for ES you'd like, i'd be happy to give it a go after i return from my honeymoon end of March.

Sent from my iPhone

On 16.03.2015, at 22:52, Curtis Ruck notifications@github.com wrote:

Elasticsearch due to it's ease of integration as an embedded or distributed tool.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/wasted/netflow/issues/12#issuecomment-81846081.

— Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/wasted/netflow/issues/12#issuecomment-82605074.

fbettag commented 9 years ago

https://github.com/wasted/netflow/commit/2174eb43f32c8e089e955b90766db02b4c86cd52 should fix this issue.

Depending on your needs, you can either implement a small-version (redis) or a big-version (cassandra) which stores each flow.

Sorry it took a bit longer.

Enjoy

fbettag commented 9 years ago

I forgot to mention that this patch has not yet been tested. I simply took a few days to get the compiler results that i wanted, but the API should not change. Just minor bugs can be expected. :)

Will be tested in production over the next few weeks, until then the finagle 6.25.0-SNAPSHOT (for redis) should be final and not have you make it compile twitter-utils, twitter-ostrich and finagle and have it publish-local.

The problem here being that finagle-redis for scala 2.11 was just finished a few weeks ago and will make it into the next final. Why we need 2.11? Because some NetFlows contain more than 22 parameters and that's a tweak in 2.11. So sorry for this, i expect it to be tested beginning of May.