microsoft / SCXcore

System Center Cross Platform Provider for Operations Manager
Microsoft Public License
36 stars 31 forks source link

Issues with Docker Redeploys #89

Open samisms opened 7 years ago

samisms commented 7 years ago

Re-raising from https://github.com/Microsoft/OMS-docker/issues/76 Comments below were from @kevi5702 .

Really hoping someone can point me in the right direction here, every time we redeploy our docker instances our SCX logs at (/var/opt/microsoft/scx/log/scx.log) beging to fill very rapidly with the following messages:

2017-09-29T15:02:39,523Z Error [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:############] statvfs() failed for /var/lib/docker/overlay/######################/merged; errno = 2

Systemctl restart omsagent##### seems to take care of this, but we were expecting the agent to be aware when a container went away and to stop trying to stat the directory it used to be mounted to.

So far we have tried removing and reinstalling the OMS bundle. But curious if there is something else we are doing wrong here?

@kevi5702

Digging around this error seems to match the code here in the PAL software:

https://github.com/Microsoft/pal/blob/master/source/code/scxsystemlib/disk/statisticallogicaldiskinstance.cpp#L269

I'm wondering if overlay needs to be added to excludes somewhere, however manually making a basic overlay mount doesn't produce the statvfs errors when its unmounted, only the warning about overlay not being recognized.

I was able to reproduce this on a new centos image with fresh OMS deploy and just running a basic hello world container.

Restarting the omid.service looks to make this go away, so not sure if something needs to be aware to update this when a container is removed?

This only seems to trigger when logical disk performance counters are enabled and only when the file system was a docker overlay FS mount.

kevi5702 commented 7 years ago

Thanks @samisms - adding the manual remove line we started running whenever this shows up in the logs:


[root@dockercentos log]# /opt/omi/bin/omicli iv root/scx { SCX_FileSystem } RemoveByName { Name /var/lib/docker/overlay/######/merged }
instance of SCX_FileSystem
{
    Caption=File system information
    Description=Information about a logical unit of secondary storage
    [Key] Name=/var/lib/docker/overlay/######/merged
    [Key] CSCreationClassName=SCX_ComputerSystem
    [Key] CSName=dockercentos
    [Key] CreationClassName=SCX_FileSystem
    Root=/var/lib/docker/overlay/######/merged
    BlockSize=0
    FileSystemSize=0
    AvailableSpace=0
    ReadOnly=false
    EncryptionMethod=Unknown
    CompressionMethod=Unknown
    CaseSensitive=true
    CasePreserved=true
    MaxFileNameLength=0
    FileSystemType=overlay
    PersistenceType=0
    IsOnline=false
}
instance of RemoveByName
{
    ReturnValue=true
}
kevi5702 commented 7 years ago

Still seeing this on boxes we create, Having to configure the omid.service to restart on log rotate to keep this file from using up all disk space in its partition.

Is there a way to exclude a path from metric gathering, or maybe a dev version of the code?

nirsingh commented 7 years ago

@kevi5702 What happens when you run /opt/omi/bin/omicli ei root/scx SCX_FileSystem manually on the machine ?

Can you please let us know steps to reproduce the bug ?

kevi5702 commented 7 years ago

Hey @nirsingh steps to reproduce:

New server - Running Centos. Steps to reproduce below:

1. Verify OS
[root@dockertest ~]# cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)

2. Set up docker repo and install docker: 
[root@dockertest ~]# cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)

3. Start and enable docker
[root@dockertest ~]# systemctl start docker; systemctl enable docker

4. Pull and start a container
[root@dockertest ~]# docker pull centos; docker run -di centos

5. Verify docker is running
69e7499d8cdd        centos              "/bin/bash"         21 seconds ago      Up 10 seconds                           laughing_spence

6. Confirm we now how our overlay file systems
[root@dockertest ~]# df -h | grep docker
overlay          30G  1.6G   28G   6% /var/lib/docker/overlay/cdac30cc66d86147e4847d9cb689d3aa1c2ed6812fd69f17cef194c76eaeab0e/merged
shm              64M     0   64M   0% /var/lib/docker/containers/69e7499d8cdd6493c706ed08f3bb0be6f389969c1156802c80cf820f3d51a10e/shm

7. Install/Start OMS-Agent
wget https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh && sh onboard_agent.sh -w 14575a75-aa88-47ed-b4de-a69670969fc8 -s REDACTED -d opinsights.azure.com 

    7.1 You should see docker extensions getting installed
    ...

----- Updating bundled packages ----- Checking if Docker is installed... Docker version greater or equal than 17.* found. Docker agent will be installed Extracting... Updating container agent ... ----- Updating package: docker-cimprov (docker-cimprov-1.0.0-27.universal.x86_64) ----- Checking if required dependencies for auoms are installed... Extracting... ...

8. Turn on (at minimum disk) Linux Performance Counters in OMS
This will need to sync down so wait until you are seeing disk data in the OMS portal
This should show up in: /etc/opt/microsoft/omsagent/AGENTID/conf/omsagent.conf
9. Once you have OMS data showing in the portal we can proceed
You'll notice that the scx log complains it doesn't know what the overlay file system is:
2017-10-25T13:30:43,303Z Warning    [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:270:4089:139900375953152] The diskstats map does not contain a key matching the device named "overlay", or only 0 columns were found
10. Manually mounting/unmounting an overlay type FS doesn't cause any issues with the SCX log however if we now remove a docker container things start to break
[root@dockertest ~]# docker stop 69e7499d8cdd; docker rm 69e7499d8cdd

11. Right away the SCX log starts to spam with the following message that its trying to stat the now gone overlay FS for the docker container. /var/opt/microsoft/scx/log/scx.log

2017-10-25T13:38:53,546Z Error      [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:239:4089:139900553468032] statvfs() failed for /var/lib/docker/overlay/cdac30cc66d86147e4847d9cb689d3aa1c2ed6812fd69f17cef194c76eaeab0e/merged; errno = 2

12. This continues until omid.service is restarted
[root@dockertest ~]# systemctl restart omid.service

Heavy docker users on Azure find their /log directory utilizing all available space.

kevi5702 commented 7 years ago

@nirsingh with the command you asked me to run at what point do you want me to run it? before a docker container is run, while its running, or a after it is removed?

nirsingh commented 7 years ago

@kevi5702 Please run the command after removing the container.

kevi5702 commented 7 years ago

@nirsingh please find output below:


instance of SCX_FileSystem
{
    Caption=File system information
    Description=Information about a logical unit of secondary storage
    [Key] Name=/
    [Key] CSCreationClassName=SCX_ComputerSystem
    [Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
    [Key] CreationClassName=SCX_FileSystem
    Root=/
    BlockSize=4096
    FileSystemSize=31671447552
    AvailableSpace=29511827456
    ReadOnly=false
    EncryptionMethod=Not Encrypted
    CompressionMethod=Not Compressed
    CaseSensitive=true
    CasePreserved=true
    MaxFileNameLength=255
    FileSystemType=xfs
    PersistenceType=2
    NumberOfFiles=58138
    IsOnline=true
    TotalInodes=15472128
    FreeInodes=15413990
}
instance of SCX_FileSystem
{
    Caption=File system information
    Description=Information about a logical unit of secondary storage
    [Key] Name=/boot
    [Key] CSCreationClassName=SCX_ComputerSystem
    [Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
    [Key] CreationClassName=SCX_FileSystem
    Root=/boot
    BlockSize=4096
    FileSystemSize=520785920
    AvailableSpace=438382592
    ReadOnly=false
    EncryptionMethod=Not Encrypted
    CompressionMethod=Not Compressed
    CaseSensitive=true
    CasePreserved=true
    MaxFileNameLength=255
    FileSystemType=xfs
    PersistenceType=2
    NumberOfFiles=326
    IsOnline=true
    TotalInodes=256000
    FreeInodes=255674
}
instance of SCX_FileSystem
{
    Caption=File system information
    Description=Information about a logical unit of secondary storage
    [Key] Name=/mnt/resource
    [Key] CSCreationClassName=SCX_ComputerSystem
    [Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
    [Key] CreationClassName=SCX_FileSystem
    Root=/mnt/resource
    BlockSize=4096
    FileSystemSize=7262887936
    AvailableSpace=7229865984
    ReadOnly=false
    EncryptionMethod=Not Encrypted
    CompressionMethod=Not Compressed
    CaseSensitive=true
    CasePreserved=true
    MaxFileNameLength=255
    FileSystemType=ext4
    PersistenceType=2
    NumberOfFiles=12
    IsOnline=true
    TotalInodes=458752
    FreeInodes=458740
}
instance of SCX_FileSystem
{
    Caption=File system information
    Description=Information about a logical unit of secondary storage
    [Key] Name=/var/lib/docker/plugins
    [Key] CSCreationClassName=SCX_ComputerSystem
    [Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
    [Key] CreationClassName=SCX_FileSystem
    Root=/var/lib/docker/plugins
    BlockSize=4096
    FileSystemSize=31671447552
    AvailableSpace=29511827456
    ReadOnly=false
    EncryptionMethod=Not Encrypted
    CompressionMethod=Not Compressed
    CaseSensitive=true
    CasePreserved=true
    MaxFileNameLength=255
    FileSystemType=xfs
    PersistenceType=2
    NumberOfFiles=58138
    IsOnline=true
    TotalInodes=15472128
    FreeInodes=15413990
}
instance of SCX_FileSystem
{
    Caption=File system information
    Description=Information about a logical unit of secondary storage
    [Key] Name=/var/lib/docker/overlay
    [Key] CSCreationClassName=SCX_ComputerSystem
    [Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
    [Key] CreationClassName=SCX_FileSystem
    Root=/var/lib/docker/overlay
    BlockSize=4096
    FileSystemSize=31671447552
    AvailableSpace=29511827456
    ReadOnly=false
    EncryptionMethod=Not Encrypted
    CompressionMethod=Not Compressed
    CaseSensitive=true
    CasePreserved=true
    MaxFileNameLength=255
    FileSystemType=xfs
    PersistenceType=2
    NumberOfFiles=58138
    IsOnline=true
    TotalInodes=15472128
    FreeInodes=15413990
}```

Commands run to produce:

[root@dockertest ~]# docker run -di centos
Created following file systems:

overlay          30G  2.1G   28G   7% /var/lib/docker/overlay/a34a9038958110dd427626f35766d59559126611563d43600a14e56ee9fae79b/merged
shm              64M     0   64M   0% /var/lib/docker/containers/3f9972e0f3d192d7c7fc559d09044115512c53f86c81b0850b7100704c3ebc6f/shm

[root@dockertest ~]# docker stop 3f9972e0f3d1; docker rm 3f9972e0f3d1
3f9972e0f3d1

Errors in log files:

2017-11-02T20:40:08,852Z Error      [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:239:5326:139941694834816] statvfs() failed for /var/lib/docker/overlay/a34a9038958110dd427626f35766d59559126611563d43600a14e56ee9fae79b/merged; errno = 2
2017-11-02T20:40:08,853Z Error      [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:239:5326:139941694834816] statvfs() failed for /var/lib/docker/overlay/a34a9038958110dd427626f35766d59559126611563d43600a14e56ee9fae79b/merged; errno = 2
kevi5702 commented 6 years ago

Wanted to add you might need to install this using the docker repos as they are the latests configurations:

wget -qO- https://get.docker.com/ | sh

Convenience install from Docker.

kevi5702 commented 6 years ago

Hello All,

I just wanted to see if there was any update on this, any information we can provide to help this along?

sarojcare commented 6 years ago

We have repro for this issue and we are investigating the issue. We will let you know our decision on fix ASAP.

ThoRumAT commented 6 years ago

Any updates on this issue?

mako274 commented 6 years ago

Also having issues with this. Any updates?

punya commented 4 years ago

@sarojcare any updates on this? It's been 2 years, did the team decide to fix this problem?

Currently we're seeing system logs fill up with spurious warnings because of this issue.

arno-pons commented 3 years ago

@sarojcare any updates on this? It's been 3 years, did the team decide to fix this problem? We have same issue on kubernetes worker nodes.