influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.61k stars 5.57k forks source link

Missing metrics {type="Database size"} for inputs.sqlserver when output to prometheus_client when running telegraf on windows. #3349

Closed mplisov closed 6 years ago

mplisov commented 7 years ago

Seeing strange behaviour with sqlserver metrics on telegraf 1.4.2 on Windows. Prometheus output misses {type="Database size"} metrics block completely. I can see those metrics in --test run of telegraf though. Second note: Issue doesn't appear when running same version of telegraf on linux.

SQL servers and connection credentials are the same (checked with sql server 2008,2012,2014)

danielnelson commented 7 years ago

Can you show what the output looks like on Linux vs Windows? Can you also show the relevant output when using --test?

mplisov commented 7 years ago

Output is quite big. Example of missing data Here is the sample from --test run (its identical on windows and linux):

> Log\ size\ (bytes),type=Database\ size,servername=SQL 5nine_monitor=1064960i,smg_pmc=72679424i,tempdb=18677760i,enterprise20=33751040i,agroplan_prod=789250048i,master=1835008i,model=1048576i,conversion3=78577664i,smg_itil=1538260992i,conversion2=445382656i,msdb=2621440i,ReportServerTempDB=1064960i,kmszarauto=15162540032i,SQLSentry=158334976i,agroplan_copy=789250048i,smg_docmngcorp=27262976i,smg_hrm_82_regl=838402048i,AgroPlan=789250048i,smg_modeling=251330560i,ReportServer=9633792i,MdxStepByStep=40239104i,psp_dw=174325760i,smg_82=20578304i,smg_buh_20=37158912i,Sqlbi_ManyToMany_Multidimensional=5242880i,courseconstructor=2097152i 1508285358000000000
> Rows\ size\ (bytes),servername=SQL,type=Database\ size SQLSentry=367001600i,agroplan_copy=78708736i,model=4259840i,ReportServerTempDB=4259840i,conversion3=153092096i,smg_modeling=233897984i,smg_hrm_82_regl=319881216i,smg_82=1016070144i,smg_itil=1988165632i,master=5111808i,enterprise20=5460852736i,conversion2=2602631168i,courseconstructor=58785792i,agroplan_prod=133234688i,ReportServer=5308416i,Sqlbi_ManyToMany_Multidimensional=10485760i,tempdb=180682752i,5nine_monitor=4259840i,AgroPlan=133234688i,smg_docmngcorp=375586816i,MdxStepByStep=106561536i,msdb=225640448i,kmszarauto=6821773312i,smg_pmc=29425664i,psp_dw=104857600i,smg_buh_20=7089881088i 1508285358000000000
mplisov commented 7 years ago

Figured out than metrics of type="Database IO" also missing on Windows instance. It seems that bug somewhere in parsing data to Prometheus format, because in --test output they are present. Example of missing data

> Log\ reads\ (bytes/sec),servername=SQL2012,type=Database\ IO,host=sql2012 model=0i,tempdb=0i,breedfarm=0i,newvysokovskoe=0i,crm_demo=0i,trade.standard=0i,erp2_demo=0i,enterprise20_copy=0i,trade11.standard=0i,newzalmilk=0i,erp2_agr_demo=0i,master=0i,erp2_demo1=0i,enterprise20_test=0i,erp_eurocond=0i,nmsztrade=0i,zalfoods=0i,docmngcorp_demo=0i,Total=0i,OpManagerDB=0i,consolidationprof=0i,learn_expert1c=0i,crm3_demo=0i,docmngcorp_test=0i,enterprise20_copy2=0i,msdb=0i,nmsztrade.sample=0i,enterprise20_demo=0i,enterprise_standard=0i 1508282208000000000
danielnelson commented 7 years ago

My access to a Windows machine is limited, but I will try to reproduce this Friday.

danielnelson commented 7 years ago

I created a powershell script that echos some of the lines from the --test output, and ran that as an exec plugin, but I wasn't missing any data from the prometheus output:

# HELP Log_reads__bytes_sec__enterprise20_demo Telegraf collected metric
# TYPE Log_reads__bytes_sec__enterprise20_demo untyped
Log_reads__bytes_sec__enterprise20_demo{host="sql2012",servername="SQL2012",type="Database IO"} 0

Can you enable a file output and check if the missing data is written?

mplisov commented 6 years ago

I think I've found the bug. For some strange reason if I specify connection host as fqdn/hostname, I get missing metrics, but when I specify IPv4 address like 127.0.0.1 it works as expected. This explains why metrics collected from telegraf on linux were ok, because linux machine was connecting using IPv4 address. Good question why this happens...

danielnelson commented 6 years ago

Wow, thats odd. Can you duplicate this just using curl or similar and diffing the output?

mplisov commented 6 years ago

Well, it was too quick decision. I was wrong on ipv6 thing, sorry. Attempt #2. I can reproduce the error-behaviour when telegraf is running as windows service as LocalSystem account. When change LocalSystem account to my domain account or run it from comandline everything is ok. I will check it on couple other servers in two days and report back.

allangood commented 6 years ago

Hi Daniel,

I'm getting this exactly behavior, but after upgrade from 1.5.3 to 1.6.X or 1.7.X for both, Windows or Linux boxes.

Some outputs like inputs.mem and inputs.sqlserver just disappears from the prometheus exposed list (http://myserver/metrics) but works perfectly with --test switch on command line.

I've tried to run as different user (on windows) and using IP address instead of name as well, and none worked. There's no relevant logs on log file or running with "debug" flag as true.

I will be glad to provide more information with you need.

Thank you.

My configuration:

[global_tags]
  collector = "telegraf"
  fqdn = "myserver.mydomain"
  location = "mylocation"
  ostype = "windows"
[[outputs.prometheus_client]]
  expiration_interval = "60s"
  listen = "0.0.0.0:9273"
[[inputs.mem]]
[[inputs.sqlserver]]
  query_version = 2
  servers = [ "Server=localhost\\instance;User Id=telegraf;Password=telegraf;app name=telegraf;encrypt=disable;log=0"]  

Output from command line:

c:\Program Files\telegraf>telegraf.exe -config "c:\Program Files\telegraf\telegraf.conf" --config-directory "c:\Program Files\telegraf\telegraf.d" --test --input-filter mem

> mem,collector=telegraf,fqdn=myserver,host=myserver,location=mylocation,ostype=windows active=0i,available=8255950848i,available_percent=48.057264930773286,buffered=0i,cached=0i,free=i,inactive=0i,slab=0i,total=17179402240i,used=8923451392i,used_percent=51.942735069226714,wire=0i 1522960875000000000

Output from curl "http://myserver/metrics":

# TYPE mem_available_percent counter
mem_available_percent{collector="telegraf",fqdn="myserver",host="myserver",location="mylocation",ostype="windows"} 48.13911625367473
# HELP mem_used_percent Telegraf collected metric
# TYPE mem_used_percent counter
mem_used_percent{collector="telegraf",fqdn="myserver",host="myserver",location="mylocation",ostype="windows"} 51.86088374632527

Output from version 1.5.3:

curl http://myserver02/metrics

# HELP mem_active Telegraf collected metric
# TYPE mem_active counter
mem_active{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_available Telegraf collected metric
# TYPE mem_available counter
mem_available{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 5.204361216e+09
# HELP mem_available_percent Telegraf collected metric
# TYPE mem_available_percent counter
mem_available_percent{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 17.31076795555045
# HELP mem_buffered Telegraf collected metric
# TYPE mem_buffered counter
mem_buffered{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_cached Telegraf collected metric
# TYPE mem_cached counter
mem_cached{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_free Telegraf collected metric
# TYPE mem_free counter
mem_free{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_inactive Telegraf collected metric
# TYPE mem_inactive counter
mem_inactive{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_slab Telegraf collected metric
# TYPE mem_slab counter
mem_slab{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 0
# HELP mem_total Telegraf collected metric
# TYPE mem_total counter
mem_total{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 3.0064300032e+10
# HELP mem_used Telegraf collected metric
# TYPE mem_used counter
mem_used{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 2.4859938816e+10
# HELP mem_used_percent Telegraf collected metric
# TYPE mem_used_percent counter
mem_used_percent{collector="telegraf",fqdn="myserver02.mydomain",host="myserver02",location="mylocation",ostype="windows"} 82.68923204444955
danielnelson commented 6 years ago

@allangood Thanks for the help testing, I can reproduce this and will work on a fix. I think this is different from the original issue though, can you open this as a new issue?

danielnelson commented 6 years ago

@mplisov I hope you were able to resolve the original problem, I'm going to assume no new is good news and close this issue, let me know if it should be reopened.

allangood commented 6 years ago

Hi @danielnelson,

I opened issue number #3977

Thank you!