Open samisms opened 7 years ago
Thanks @samisms - adding the manual remove line we started running whenever this shows up in the logs:
[root@dockercentos log]# /opt/omi/bin/omicli iv root/scx { SCX_FileSystem } RemoveByName { Name /var/lib/docker/overlay/######/merged }
instance of SCX_FileSystem
{
Caption=File system information
Description=Information about a logical unit of secondary storage
[Key] Name=/var/lib/docker/overlay/######/merged
[Key] CSCreationClassName=SCX_ComputerSystem
[Key] CSName=dockercentos
[Key] CreationClassName=SCX_FileSystem
Root=/var/lib/docker/overlay/######/merged
BlockSize=0
FileSystemSize=0
AvailableSpace=0
ReadOnly=false
EncryptionMethod=Unknown
CompressionMethod=Unknown
CaseSensitive=true
CasePreserved=true
MaxFileNameLength=0
FileSystemType=overlay
PersistenceType=0
IsOnline=false
}
instance of RemoveByName
{
ReturnValue=true
}
Still seeing this on boxes we create, Having to configure the omid.service to restart on log rotate to keep this file from using up all disk space in its partition.
Is there a way to exclude a path from metric gathering, or maybe a dev version of the code?
@kevi5702 What happens when you run /opt/omi/bin/omicli ei root/scx SCX_FileSystem manually on the machine ?
Can you please let us know steps to reproduce the bug ?
Hey @nirsingh steps to reproduce:
New server - Running Centos. Steps to reproduce below:
1. Verify OS
[root@dockertest ~]# cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)
2. Set up docker repo and install docker:
[root@dockertest ~]# cat /etc/centos-release
CentOS Linux release 7.4.1708 (Core)
3. Start and enable docker
[root@dockertest ~]# systemctl start docker; systemctl enable docker
4. Pull and start a container
[root@dockertest ~]# docker pull centos; docker run -di centos
5. Verify docker is running
69e7499d8cdd centos "/bin/bash" 21 seconds ago Up 10 seconds laughing_spence
6. Confirm we now how our overlay file systems
[root@dockertest ~]# df -h | grep docker
overlay 30G 1.6G 28G 6% /var/lib/docker/overlay/cdac30cc66d86147e4847d9cb689d3aa1c2ed6812fd69f17cef194c76eaeab0e/merged
shm 64M 0 64M 0% /var/lib/docker/containers/69e7499d8cdd6493c706ed08f3bb0be6f389969c1156802c80cf820f3d51a10e/shm
7. Install/Start OMS-Agent
wget https://raw.githubusercontent.com/Microsoft/OMS-Agent-for-Linux/master/installer/scripts/onboard_agent.sh && sh onboard_agent.sh -w 14575a75-aa88-47ed-b4de-a69670969fc8 -s REDACTED -d opinsights.azure.com
7.1 You should see docker extensions getting installed
...
----- Updating bundled packages ----- Checking if Docker is installed... Docker version greater or equal than 17.* found. Docker agent will be installed Extracting... Updating container agent ... ----- Updating package: docker-cimprov (docker-cimprov-1.0.0-27.universal.x86_64) ----- Checking if required dependencies for auoms are installed... Extracting... ...
8. Turn on (at minimum disk) Linux Performance Counters in OMS
This will need to sync down so wait until you are seeing disk data in the OMS portal
This should show up in: /etc/opt/microsoft/omsagent/AGENTID/conf/omsagent.conf
9. Once you have OMS data showing in the portal we can proceed
You'll notice that the scx log complains it doesn't know what the overlay file system is:
2017-10-25T13:30:43,303Z Warning [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:270:4089:139900375953152] The diskstats map does not contain a key matching the device named "overlay", or only 0 columns were found
10. Manually mounting/unmounting an overlay type FS doesn't cause any issues with the SCX log however if we now remove a docker container things start to break
[root@dockertest ~]# docker stop 69e7499d8cdd; docker rm 69e7499d8cdd
11. Right away the SCX log starts to spam with the following message that its trying to stat the now gone overlay FS for the docker container. /var/opt/microsoft/scx/log/scx.log
2017-10-25T13:38:53,546Z Error [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:239:4089:139900553468032] statvfs() failed for /var/lib/docker/overlay/cdac30cc66d86147e4847d9cb689d3aa1c2ed6812fd69f17cef194c76eaeab0e/merged; errno = 2
12. This continues until omid.service is restarted
[root@dockertest ~]# systemctl restart omid.service
Heavy docker users on Azure find their /log directory utilizing all available space.
@nirsingh with the command you asked me to run at what point do you want me to run it? before a docker container is run, while its running, or a after it is removed?
@kevi5702 Please run the command after removing the container.
@nirsingh please find output below:
instance of SCX_FileSystem
{
Caption=File system information
Description=Information about a logical unit of secondary storage
[Key] Name=/
[Key] CSCreationClassName=SCX_ComputerSystem
[Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
[Key] CreationClassName=SCX_FileSystem
Root=/
BlockSize=4096
FileSystemSize=31671447552
AvailableSpace=29511827456
ReadOnly=false
EncryptionMethod=Not Encrypted
CompressionMethod=Not Compressed
CaseSensitive=true
CasePreserved=true
MaxFileNameLength=255
FileSystemType=xfs
PersistenceType=2
NumberOfFiles=58138
IsOnline=true
TotalInodes=15472128
FreeInodes=15413990
}
instance of SCX_FileSystem
{
Caption=File system information
Description=Information about a logical unit of secondary storage
[Key] Name=/boot
[Key] CSCreationClassName=SCX_ComputerSystem
[Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
[Key] CreationClassName=SCX_FileSystem
Root=/boot
BlockSize=4096
FileSystemSize=520785920
AvailableSpace=438382592
ReadOnly=false
EncryptionMethod=Not Encrypted
CompressionMethod=Not Compressed
CaseSensitive=true
CasePreserved=true
MaxFileNameLength=255
FileSystemType=xfs
PersistenceType=2
NumberOfFiles=326
IsOnline=true
TotalInodes=256000
FreeInodes=255674
}
instance of SCX_FileSystem
{
Caption=File system information
Description=Information about a logical unit of secondary storage
[Key] Name=/mnt/resource
[Key] CSCreationClassName=SCX_ComputerSystem
[Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
[Key] CreationClassName=SCX_FileSystem
Root=/mnt/resource
BlockSize=4096
FileSystemSize=7262887936
AvailableSpace=7229865984
ReadOnly=false
EncryptionMethod=Not Encrypted
CompressionMethod=Not Compressed
CaseSensitive=true
CasePreserved=true
MaxFileNameLength=255
FileSystemType=ext4
PersistenceType=2
NumberOfFiles=12
IsOnline=true
TotalInodes=458752
FreeInodes=458740
}
instance of SCX_FileSystem
{
Caption=File system information
Description=Information about a logical unit of secondary storage
[Key] Name=/var/lib/docker/plugins
[Key] CSCreationClassName=SCX_ComputerSystem
[Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
[Key] CreationClassName=SCX_FileSystem
Root=/var/lib/docker/plugins
BlockSize=4096
FileSystemSize=31671447552
AvailableSpace=29511827456
ReadOnly=false
EncryptionMethod=Not Encrypted
CompressionMethod=Not Compressed
CaseSensitive=true
CasePreserved=true
MaxFileNameLength=255
FileSystemType=xfs
PersistenceType=2
NumberOfFiles=58138
IsOnline=true
TotalInodes=15472128
FreeInodes=15413990
}
instance of SCX_FileSystem
{
Caption=File system information
Description=Information about a logical unit of secondary storage
[Key] Name=/var/lib/docker/overlay
[Key] CSCreationClassName=SCX_ComputerSystem
[Key] CSName=dockertest.3wjaneoxgxuuhlwpgr0x0krdjb.jx.internal.cloudapp.net
[Key] CreationClassName=SCX_FileSystem
Root=/var/lib/docker/overlay
BlockSize=4096
FileSystemSize=31671447552
AvailableSpace=29511827456
ReadOnly=false
EncryptionMethod=Not Encrypted
CompressionMethod=Not Compressed
CaseSensitive=true
CasePreserved=true
MaxFileNameLength=255
FileSystemType=xfs
PersistenceType=2
NumberOfFiles=58138
IsOnline=true
TotalInodes=15472128
FreeInodes=15413990
}```
Commands run to produce:
[root@dockertest ~]# docker run -di centos
Created following file systems:
overlay 30G 2.1G 28G 7% /var/lib/docker/overlay/a34a9038958110dd427626f35766d59559126611563d43600a14e56ee9fae79b/merged
shm 64M 0 64M 0% /var/lib/docker/containers/3f9972e0f3d192d7c7fc559d09044115512c53f86c81b0850b7100704c3ebc6f/shm
[root@dockertest ~]# docker stop 3f9972e0f3d1; docker rm 3f9972e0f3d1
3f9972e0f3d1
Errors in log files:
2017-11-02T20:40:08,852Z Error [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:239:5326:139941694834816] statvfs() failed for /var/lib/docker/overlay/a34a9038958110dd427626f35766d59559126611563d43600a14e56ee9fae79b/merged; errno = 2
2017-11-02T20:40:08,853Z Error [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:239:5326:139941694834816] statvfs() failed for /var/lib/docker/overlay/a34a9038958110dd427626f35766d59559126611563d43600a14e56ee9fae79b/merged; errno = 2
Wanted to add you might need to install this using the docker repos as they are the latests configurations:
wget -qO- https://get.docker.com/ | sh
Convenience install from Docker.
Hello All,
I just wanted to see if there was any update on this, any information we can provide to help this along?
We have repro for this issue and we are investigating the issue. We will let you know our decision on fix ASAP.
Any updates on this issue?
Also having issues with this. Any updates?
@sarojcare any updates on this? It's been 2 years, did the team decide to fix this problem?
Currently we're seeing system logs fill up with spurious warnings because of this issue.
@sarojcare any updates on this? It's been 3 years, did the team decide to fix this problem? We have same issue on kubernetes worker nodes.
Re-raising from https://github.com/Microsoft/OMS-docker/issues/76 Comments below were from @kevi5702 .
Really hoping someone can point me in the right direction here, every time we redeploy our docker instances our SCX logs at (/var/opt/microsoft/scx/log/scx.log) beging to fill very rapidly with the following messages:
2017-09-29T15:02:39,523Z Error [scx.core.common.pal.system.disk.statisticallogicaldiskinstance:############] statvfs() failed for /var/lib/docker/overlay/######################/merged; errno = 2
Systemctl restart omsagent##### seems to take care of this, but we were expecting the agent to be aware when a container went away and to stop trying to stat the directory it used to be mounted to.
So far we have tried removing and reinstalling the OMS bundle. But curious if there is something else we are doing wrong here?
@kevi5702
Digging around this error seems to match the code here in the PAL software:
https://github.com/Microsoft/pal/blob/master/source/code/scxsystemlib/disk/statisticallogicaldiskinstance.cpp#L269
I'm wondering if overlay needs to be added to excludes somewhere, however manually making a basic overlay mount doesn't produce the statvfs errors when its unmounted, only the warning about overlay not being recognized.
I was able to reproduce this on a new centos image with fresh OMS deploy and just running a basic hello world container.
Restarting the omid.service looks to make this go away, so not sure if something needs to be aware to update this when a container is removed?
This only seems to trigger when logical disk performance counters are enabled and only when the file system was a docker overlay FS mount.