influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.49k stars 5.55k forks source link

Windows version #30

Closed JulienChampseix closed 8 years ago

JulienChampseix commented 9 years ago

Hi, What's an windows version is planned ? thanks for your feedback

discoduck2x commented 9 years ago

+1

discoduck2x commented 9 years ago

Any comments atleast!?

jbrantly commented 9 years ago

+1

sparrc commented 9 years ago

At the moment, no, but it could be considered for a future enhancement. What are the sorts of services you're looking for monitoring?

discoduck2x commented 9 years ago

i would say initially have a plugin that supports any windows performance counter , including wildcard on insstance name , for example \process( * )\% Processor Time \process(sql * )\% Processor Time there are some attempts out there which could call for inspiration, but the collectm on github uses a node js wrapper for typeperf which is really nonintrusive cpuwise. other flavors theres collectw that has some nice touches such as checking conffile for changes meaning u can push out from central place. just remember wmi is expensive, use something more native

Vye commented 9 years ago

No Windows support is a deal breaker. Looking forward to seeing this implemented.

UPDATE: A WMI plugin would cover 90% of what I need. I use Nagios for service status and some low resolution trending. There isn't a good cross-platform metrics publisher that I'm aware of.

What are Windows users using to ship (at least system) metrics to InfluxDB today?

elvarb commented 8 years ago

@Vye This has worked very well for me, as fully featured as you can get. https://github.com/MattHodge/Graphite-PowerShell-Functions

TopBeat from Elastic is in beta and supports windows but it does not support all wmi calls. What it does that is very good is that it monitors and ships metrics for all running processes. So you get per process memory and cpu usage. Very cool.

ymettier commented 8 years ago

Hello,

I saw the "Help Wanted" tag and maybe you would be interested in this : https://github.com/cloudfoundry/gosigar

I saw at least 2 collectors that use gosigar, including Topbeat (from Elastic).

You may also want to see how Mackerel works :

For wmi (including processes), you can use https://github.com/StackExchange/wmi. For processees, a code exemple is at start of https://github.com/StackExchange/wmi/blob/master/wmi.go

I hope it helps.

Regards, Yves

sparrc commented 8 years ago

@ymettier gosigar specifically doesn't support windows

The other libraries look useful, thanks for the recommendations :+1:

ymettier commented 8 years ago

Hello,

About gosigar, I'm confused...

https://github.com/cloudfoundry/gosigar/blob/master/sigar_windows.go

func (self *ProcList) Get() error {
    return notImplemented()
}

OK, you are right. I'm sorry.

But from topbeat : https://github.com/elastic/topbeat/blob/master/Godeps/_workspace/src/github.com/elastic/gosigar/sigar_windows.go

func (self *ProcList) Get() error {

    var enumSize int
    var pids [1024]C.DWORD

    // If the function succeeds, the return value is nonzero.
    ret, _, _ := procEnumProcesses.Call(
        uintptr(unsafe.Pointer(&pids[0])),
        uintptr(unsafe.Sizeof(pids)),
        uintptr(unsafe.Pointer(&enumSize)),
    )
    if ret == 0 {
        return syscall.GetLastError()
    }

    results := []int{}

    pids_size := enumSize / int(unsafe.Sizeof(pids[0]))

    for _, pid := range pids[:pids_size] {
        results = append(results, int(pid))
    }

    self.List = results

    return nil
}

So i'm sorry for the wrong URL. I would agree if you do not want to use gosigar (I would not use it either after this confusion). However, you can get inspiration from Topbeat version of gosigar.

Regards, Yves

ymettier commented 8 years ago

Hello again...

Reading again @discoduck2x's comment about typeperf and collectm...

Collectm uses http://markitondemand.github.io/node-perfmon/. Reading the code of this module is very interesting and I would have recommended a similar implementation 1 or 2 months ago... Every time you call perfmon() to add a counter, it will add it in a list. Then it will run typeperf with the list of counters. And restart it when the list changes (e.g. when you call perfmon() again).

Today I would not recommand this implementation because I noticed that the list of counters is limited. I have not investigated on how many counters you can ask for, but the limit is probably due to the length of the command line. This is a bug in Collectm (and in node-perfmon).

But I have no idea on how to do it without calling as many typeperf as needed.

EDIT. https://github.com/lxn/win/blob/master/pdh.go is probably a good start point for typeperf, perfmon & co. This is just about pdh.dll.

Regards, Yves

samahee commented 8 years ago

downloaded https://s3.amazonaws.com/get.influxdb.org/telegraf/telegraf_0.1.9_amd64.msi measurements is cpu, mem. how to collect network, disk metric?

oliverjanik commented 8 years ago

Looks like datadog client uses WMI and Event Log https://github.com/DataDog/dd-agent/tree/master/checks.d

sparrc commented 8 years ago

Yes, we will need to do something with the Windows event log, this is going to be difficult though because I guess we'll need to have our own log wrapper that uses either stdout/stderr or the windows event log depending on the system.

https://github.com/golang/sys/tree/master/windows/svc/example

ghost commented 8 years ago

I tried https://s3.amazonaws.com/get.influxdb.org/telegraf/telegraf_0.1.9_amd64.msi on windows server and it works fine! But i didn't find the official download page of Telgraf Windows version, where i can find the Telegraf newest release for Windows?

sparrc commented 8 years ago

@dbellantuono It's not officially supported yet so we aren't distributing it. Packaging up 0.1.9 was a bit of a one-off, there are also many plugins you will find don't work properly.

JulienChampseix commented 8 years ago

is it more official now ? where to find a list of msi built ?

sparrc commented 8 years ago

@JulienChampseix It's not, sorry, I have many other core changes to telegraf to work on right now. I will be sure to update this case when I've made progress.

hurrycaine commented 8 years ago

I saw the question above about what are people using to get windows metrics into Influxdb. I am using sensu monitoring framework. There are a few checks (for Windows) and they are easy to extend. Normally sensu spins up another process when doing a check but you can also write extensions where the code runs in the main ruby loop. You can write the checks really in anything but most are in ruby that shell out and run some kind of command line or use wmi gem of some sort.

I am using this wmi_metrics currently but i have the network stuff commented out as it caused wmi to hang eventually.

https://github.com/sensu/sensu-community-plugins/blob/master/extensions/checks/wmi_metrics.rb

Since i am using other metrics on linux that also write out to graphite I am using this great handler here https://github.com/jhrv/sensu-influxdb-extension

I then am using a check that queries influxdb and is really flexible on what to search for and what are the thresholds. https://github.com/zeroXten/check_influxdb_query

Then im using graphite to show the metrics which you can link or embed into uchiwa (sensu dashboard).

cwegener commented 8 years ago

So, having read the comments in this issue, I still have no idea what the consensus is about designing such a feature for Telegraf. I'm rather brand new to Telegraf, but I do have a little bit of Windows experience. So, my question is, would it be reasonable to implement this feature via the go-com wrapper as StackExchange are doing in their go-wmi code? I am a bit concerned about the effect on code maintainability when using so many wrappers. I think a pure golang implementation of this feature might be a better way to go :smile: ... (sorry for the bad pun).

The only way that I can see to implement this purely in golang is WinRM.

For your reference, here is a good comparison of RPC,WMI and WinRM: http://blogs.technet.com/b/josebda/archive/2010/04/02/comparing-rpc-wmi-and-winrm-for-remote-server-management-with-powershell-v2.aspx

There also seems to be a bit of existing WinRM golang source out there: https://github.com/masterzen/winrm

EDIT: Forgot to comment on the other wrapper idea about using the https://github.com/lxn/win/blob/master/pdh.go wrapper. This approach would indeed alleviate the concern that I think exists with the COM-wrapper approach.

cwegener commented 8 years ago

As a general Windows perf counter project to use as inspiration and help, I guess the C# PerfTrap implementation that sends graphite output is a very useful reference: https://github.com/Iristyle/PerfTap#other-historical-notes

TheFlyingCorpse commented 8 years ago

@cwegener - I got a very rough proof of concept (outside of telegraf) working with the github.com/lxn/win example you hinted at. I'm new to programming in general, used to scripting and if it works dont optimize. I need to figure out some basic stuff with Go and programming in general regarding pointer/referral handling. I hope to get this done this week, hopefully useful on most if not all Performance Counters.

TheFlyingCorpse commented 8 years ago

@cwegener - Got a lot further today, the proof of concept is working standalone in the expected way, I hope to move this to telegraf on Friday or Saturday. Tomorrow is maintenance evening @ work.

Sample of how I am specifying performance counters:

[[inputs.win_pdh.perfobjects]]
objectName = "DFS Namespace Service Referrals"
Counters = ["Requests Processed"]
Instances = ["*"]

[[inputs.win_pdh.perfobjects]]
objectName = "DFS Namespace Service Referrals"
Counters = ["Requests Processed","Avg. Response Time"]
Instances = ["Trusted Domain Referrals", "Sysvol-Netlogon Referrals"]

[[inputs.win_pdh.perfobjects]]
objectName = "DFS Replication Service Volumes"
Counters = ["Data Lookups"]
Instances = ["*"]

[[inputs.win_pdh.perfobjects]]
objectname = "LogicalDisk"
counters = ["% Disk Time","IntentionallyInvalid"]
instances = ["C:","F:","G:"]
cwegener commented 8 years ago

@TheFlyingCorpse - That sounds great! I'm a lousy programmer myself. :wink: But I'm more than happy to code review and of course to deploy any telegraf code to some spare machines I have running around. :smile:

TheFlyingCorpse commented 8 years ago

@cwegener - Get the latest from git if possible, the pull request was accepted. Performance Counters wooo !

cwegener commented 8 years ago

@TheFlyingCorpse Excellent work. I will build a package and run it on a few systems. I just read through the commit 15ec51a179141094b4f8b39d1d4079169cc354bb One important comment should be added to the plugins/inputs/win_perf_counters/README.md. Only the Windows versions of Vista/2008 and higher will work with this plugin (Due to the use of the following Win32 Performance Counter function)

    ret = win.PdhAddEnglishCounter(handle, query, 0, &counterHandle)

https://msdn.microsoft.com/en-us/library/windows/desktop/aa372536(v=vs.85).aspx

I am certainly not advocating that the plugin should support older versions than Vista! But I know from my professional experience that still to this day, there are a lot of older versions of Windows being used in many places. :anguished: People will have all sorts of different expectations and the README.md file is a good place to start managing those expectations I guess.

cwegener commented 8 years ago

@TheFlyingCorpse Package is built and working. Deploying to some machines now. Found one issue: -sample output is broken for all the counters in your sample string that have a percentage sign in the counter name. I'll have a dig around and see why that is.

cwegener commented 8 years ago

@TheFlyingCorpse Also, the following counters are not being printed to stdout during telegraf -test:

This only affects the -test All counters are correctly gathered and sent to the output plugin (influxdb in my case).

cwegener commented 8 years ago

-sample-config output bug is fixed in pull request #620

discoduck2x commented 8 years ago

@cwegener if u gather a fair amount of counters every second how does that affect the system? What is the telegraf process cpu usage?

discoduck2x commented 8 years ago

@cwegener @TheFlyingCorpse ,"all" other attempts ive tested, collectm, collectw, perftap,powershell functions etc they all induce a fair amount of cpu usage themselfs which is bad. im currently just parsing typeperf files and shipping their contents to influx from another system. if telegraf gets down to a resource footprint similar to running typeperf on each "monitored" host then this will be the go-to-solution for anyone wanting host metrics from windows hosts. Note its important to test with a polling interval of atleast every second.

im travelling atm but will most likley test this on sunday when i get back - the hype is real!

cwegener commented 8 years ago

@discoduck2x I will collect and graph the CPU usage of telegraf.exe and report back. How many counters do you consider to be a 'fair amount'? My polling interval is currently at 10 seconds (since all my other influxdb databases are at 10 seconds). I can run some tests at 1 second.

discoduck2x commented 8 years ago

@cwegener Id say 50-100 , not all would need per second interval but cpu , nic, process etc would need frequent interval. Possibility to set interval per counter group?

cwegener commented 8 years ago

@discoduck2x I have now added the collection of "% Processor Time" for all instances of "Process" to my test setup. This now gives me ~150 counters. I will leave it running for a bit and see how the median and average % Processor time of telegraf.exe looks ... For now, I have not been able to make telegraf.exe go beyond 0.2% Processor utilization (I'm still at 10 second interval though ...)

cwegener commented 8 years ago

@discoduck2x I have dropped my telegraf test database and I am now running a 1 second interval for all telegraf counters. I am still collecting ~150 counters on the test machine. The "% Processor Time" mean value for 'telegraf.exe' still sits below 0.2 percent. I will leave it running like that for a while and post a screenshot once I have 12 hours worth of data in the database.

cwegener commented 8 years ago

I can see a bit of Processor utilization by telegraf.exe now, but it is still considerably low (Over a 5 minute window. avg: 0.3%, max 5% on a test machine with 2 vCPU - Xeon E5 2640 v2)

screenshot 2016-01-30 at 8 33 13 pm

elvarb commented 8 years ago

Very promising everyone, in my experience when using graphite powershell and Windows counters is that some counters take more power than others so having the ability to group counters by polling intervals would be a very useful addition.

One counter that I remember being heavy is disk utilization, free space. Would be interesting to see a test of one of those with 1 sec interval. This is also one example of a counter that is not that useful to have at 1 sec interval, but rather in 5 minutes or more intervals.

TheFlyingCorpse commented 8 years ago

Happy to hear that @cwegener , my intention was to gather with a very low footprint.

The plugin iterates over the configuration on startup to find the valid queries, then saves these with a handle to the search, so it can just query for the next value every interval, instead of creating a new query, waiting for the results (1s+ needed on the first sample), cleanup, wait for interval and do the process all over. What this does translates to as it is now is that if whatever you want to collect is not in the perfobjects on Telegrafs startup, it will not discover it after when it comes up. Telegraf must be restarted. It is simple to add support for this, it could rediscover what queries are valid after X Gathers, say every 100 gathers or such.

discoduck2x commented 8 years ago

@TheFlyingCorpse @cwegener , how do i test this ? is there a compiled bin somewhere or do i have to dl src from github?

One Q , can you do wildcards on the counters? such as "\process(sql)\% Processor Time" would give you all processes with sql in their processname.

TheFlyingCorpse commented 8 years ago

@discoduck2x - I can provide a compiled version if you'd like to. It is not too hard to build your own, maybe 10-15 mins tops. Install git with git in path, install make to path as well as go. Then just set gopath=C:\temp\work, go get github.com/influxdata/telegraf, cd %gopath%\src\influxdata\telegraf, make windows Inside of the folder should now (as of the current git version) be a telegraf.exe that you can play with :)

On the wildcard for counters, that is a good idea and I see the use case for it! The issue with it is that processes stop and start. one way could maybe be to query for all instances containing chrome, or get from all and filter out those not matching chrome, save these for querying later. This would likely need some basic functionality so it would rediscover what queries are valid in some intervals, due to processes starting and stopping. A big downside of performance counters in general when it comes to instances as processes is that one day its sql#1, tomorrow if the service was restarted for some reason, it might be sql#9.

cwegener commented 8 years ago

@discoduck2x Regarding wildcards for instance names - I would probably prefer to simply collect all instances with the win_perf_counters plugin and do the filtering at a later stage due to the complexities involved when working with instance names. Though I clearly agree with your use case. However, the only benefit I can see right now from filtering down the number of instances being collected would be reduced storage footprint in your influxdb/otsdb/graphite ... Once all the measurements are in your database, it is absolutely trivial to filter the results (e.g. in InfluxQL):

SELECT * FROM win_process WHERE instance =~ /.*sql.*/ AND time > now() - 10s
discoduck2x commented 8 years ago

@cwegener & @TheFlyingCorpse ,,,first, great discussing thist with you! i dont think it makes sense to collect all instances,, you will end up with too many and by that making a too big footprint cpu wise for the collection process. Sure, there might be ppl wanting to collect ALL running processes cpu mem io usage etc and all processes that will ever start going forward after start of the telegraf process - but , i dont think that makes much sense.

If you are keen on collecting metrics about a few processes that matters to you then surley you are not allowing the system to spawn unwanted or uncontrolled processes here and there just like that right? my usecase is within finance indusrty and we need high performance and low latency and we even skip using the server manufacorers "own" processse like HP Iinsight managment agents etc because they themselfs add on / delay the system. At least make it controllable - if someone wants to collect ALL then let that be possible,, but dont collect all just because it seems easier from an implementationperspective since it does not make sense.

also since the counters if collected with typeperf works totally fine with stars as wildcards (dont know how to type star here withut it getting replaced :) ) ,,, think the collectm or collectw (cant remember which one) ,, monitors its conf file for changes like every 30 seconds and if something gets added there - like a new counter then that counter will be collected going forward, the collect m/w´s problems thought was that every time it "rechecked or resched" due to config file change, ud peg one cpu core for 1-2seconds..... :)

also, if using typeperf (sorry for spamming bout typperf all the time, but its really great at no making a footprint while collecting alot of counters) ,, if lets say you monitor a process by instance name with wildcard , then if that process terminates,, and starts up again afte rX time,, it will be collected again (prob just aslong as the process name is the same,) ,, i do agree with your comment on the SQL processes that seems to get a random number attached at startup (no biggie though i think)

btw if im gonna try build this myself which GO version 1.4 or 1.5?

TheFlyingCorpse commented 8 years ago

@discoduck2x, I might have a solution for wildcards in the instance name.

On Pre Vista compability, I might also have a solution here.

In the Pull Requests for Telegraf there is now also support for collecting a specific object every X iteration of the telegraf interval.

On which Go version, I expect both 1.4 and 1.5 to work, myself I have tested only with 1.5 as that is what was coming down via chocolatey

TheFlyingCorpse commented 8 years ago

Pre Vista compability "added" in my local for now, testing a bit more, it will use the pre-vista method of adding perfcounters, this will require users to use the localized counters instead of the english ones that work across installed languages.

TheFlyingCorpse commented 8 years ago

Vista pushed to local, awaiting merge of my two other PR's before that is pushed. On the wildcard, match, the API doesnt seem to support it.

Depending on how many processes there is you want data for, if for example Chrome you can do Instances=["chrome","chrome#1","chrome#2","chrome#3","chrome#4"] I can with ease add in a function so you can select Counters to handle also as tags, so you can follow your processes between instances, with "ID Process" for example. This way it would be feasible to have a trigger or counter for rediscovering the valid queries to rediscover if any new have started since previous.

discoduck2x commented 8 years ago

@TheFlyingCorpse then the api seems flawed? as the native way of collecting counters do support wildcards in instance names? is there no way round this? I will not be able to go for this unless i can have a generic conf file which specifies all our proprietary service/process names (or parts of their names that is) with wildcards to push out to +200 physical boxes :( ,, will have to keep parsing typeperf files with powershell untill something comes along that does it

cwegener commented 8 years ago

@discoduck2x typeperf.exe is a cli tool that talks to the api. Implementation details of typeperf.exe are not easily accessible without a source code license from Microsoft. But if you have a look at @TheFlyingCorpse's implementation you will see that you can change the code yourself very easily to perform the discovery/validation of the pdh query on every call to gather.

My guess is that performing the discovery/validation of the pdh query every time gather gets called, probably isn't going to add a lot of processing overhead after all.

cwegener commented 8 years ago

On another note, I am working on the code to be able to run telegraf as a windows service. I have a functional PoC running. Depending on how much time I can spend on telegraf development this week, I might have pull request ready in the next week or so.

sparrc commented 8 years ago

@cwegener how are you implementing that? what is currently blocking it running as a service? I was planning on using something like https://nssm.cc/ to wrap the binary

discoduck2x commented 8 years ago

@cwegener ,, oh i see, sorry if i sounded abit naive, i admit i did :) im ops end user here pretty much, no dev skills