opskumu commented 7 years ago

历年周报

CNUTCon 全球运维技术大会 2017 PPT 下载合集

华为使用Docker支持系统容器的优化实践
- 资源信息隔离 --> 使用 lxcfs Docker 容器的显示问题及修复
腾讯游戏容器云平台演进之路
- P2P 镜像传输
- 存储驱动选项 --> OverlayFS + XFS
网易大规模场景下 Kubernetes Service 负载均衡性能优化
- iptables 相关的介绍可以好好学习一下

基本上翻阅了所有容器相关的 PPT，单从内容来看，可以借鉴的不是很多。不过，笔者最近在看官方的文档，倒是细节相关还是学到很多东西的，有时间的可以把 Docker 官方的文档好好看上几遍，绝对比这些 PPT 介绍的东西要收获的多。

拓展

operators
- 之前参加 Goper 2017 大会的时候就有讲师介绍了 operators 相关的知识，一直没有来得及看，后续还是要好好看一下的，看介绍是个很有意思的东西。

《Docker in the Trenches》

Do not run ssh in your containers.

Once an image is created, the image never gets modified, so all you can do is build new images off of it.

opskumu commented 7 years ago

Kubernetes

opskumu commented 6 years ago

Docker 中关于 Java 的资源限制

早期是通过脚本来实现 https://github.com/fabric8io-images/java/tree/master/images/centos/openjdk8/jdk

现在 Java SE 8u131 已经直接支持了 https://blogs.oracle.com/java-platform-group/java-se-support-for-docker-cpu-and-memory-limits

opskumu commented 6 years ago

2018-07-09~13

KubeDNS 超时丢包的问题（tcpdump 抓包确认 DNS 查询有丢包现象）
- 压测
  - dnsperf 压测 KubeDNS 性能不是很理想（需要进一步确认）
  - Kubernetes 官方 Performance tests and benchmarks 见 kubernetes/perf-tests
- KubeDNS 延迟相关 issue

Calico BGP 性能压测（qperf 压测，压测方式参考腾讯云容器网络 vpc 对比 vxlan 性能测试）
- 实测吞吐性能和网络延迟都接近裸机网络

Wireshark 过滤无响应包 dns && (dns.flags.response == 0) && ! dns.response_in Filter DNS queries without matched responses

开源 Docker Registry 选择
- SUSE/Portus
- vmware/harbor

SUSE Portus 没有尝试过，主要用的还是 Harbor。虽然 Harbor 在实际使用过程中有种种不如意的地方，但是也比原生的 distribution 要好很多吧，所以也没有什么吐槽的。

opskumu commented 6 years ago

2018-07-16~20

docker overlay2 大小限制（需要 xfs 挂载项支持 pquota）

--storage-opt overlay2.size=30G

Sets the default max size of the container. It is supported only when the backing fs is xfs and mounted with pquota mount option. Under these conditions the user can pass any size less then the backing fs size. OVERLAY2 OPTIONS

How to Enable Disk Quotas on an XFS File System

Arrays, slices (and strings): The mechanics of 'append'

opskumu commented 6 years ago

2018-07-23~28

早上线上 Docker 主机内核抛错如下信息：

unregister_netdevice: waiting for lo to become free. Usage count = 1

查了下类似的问题还不少，主要描述是 Kernel 层的 Bug 吧，典型的几个可以参考：

其中 Redhat 官方的解决思路是升级内核可以解决，参见 kernel:unregister_netdevice: waiting for lo to become free. Usage count = 1，订阅用户才可以有访问权限，具体可以看如下截图：

vim 最大化当前 pane

Ctrl+W + | 最大化当前 pane
Ctrl+W + = 恢复 pane

KubeDNS 间歇性解析问题

https://github.com/kubernetes/kubernetes/issues/47142

Kubernetes Job 重启策略

Job 重启策略 Never 和 OnFailure，推荐只使用 OnFailure

不建议使用 Never，使用 Never 策略的 Job 失败会直接重建，可能短时间内产生大量的异常 Pod

opskumu commented 6 years ago

2018-07-30~08-03

vim 内部搜索工具
- 结合 fzf 工具搜索 https://github.com/junegunn/fzf.vim
- 结合 https://github.com/ggreer/the_silver_searcher 文件内容搜索
How to remove ^H in a file using vim?
git ssh proxy

Host github.com
    User                    git
    ProxyCommand            nc -X 5 -x 127.0.0.1:1080 %h %p

opskumu commented 6 years ago

2018-08-06~10

RHEL7: kernel crash in xfs_vm_writepage - kernel BUG at fs/xfs/xfs_aops.c:1062!
React 等静态资源更新，浏览器缓存问题解决
- vue spa nginx 如何配置，能够让 index.html 不缓存
- nginx: disable caching of a single file with try_files directive

Pod 状态值

Pending 意味着 pod 已被系统接受（有可能调度失败或者已调度下载镜像等）
Running 意味着 pod 已绑定节点，并且所有的容器已经启动过，或者至少一个容器正在运行或者进程正在重启
Succeeded 意味着 pod 所有容器已经自动以 0 的退出码终止，并且系统不会在重启这些容器
Failed 意味着 pod 所有的容器已经终止，并且至少一个容器以非 0 的状态终止或者被系统停止
Unknown 意味着因为某种原因不能同步 pod 状态信息，一般是与 pod 所在节点通信发生错误

以上状态只能覆盖 Pod 的状态，实际在开发过程中需要尽量显示应用 Pod 的详细信息，如 Pending 状态下是在 pull 镜像还是资源不足无法调度等情况都要透明的呈现给用户。针对 Running 的状态也是，此时 Pod 可能正在销毁、Crash、镜像 pull 失败，另外在使用探针的情况下，探针检测服务可能 Not Ready 等等。因此需要额外判断 PodCondition、ContainerStatus 以及 Pod 元数据中是否有 DeletionTimestamp 等。

Pod Lifecycle

opskumu commented 6 years ago

2018-08-13~17

`docker system prune` Bug

执行 docker system prune 清理镜像出现了以下问题：

No such file or directory for /var/lib/docker/overlay2

直接导致启动容器失败，貌似是 Docker 的 bug ... https://github.com/docker/for-mac/issues/1396

`nsenter` 进入 Docker 网络命名空间

docker inspect --format "{{.State.Pid}}" <容器 ID or 容器 Name>

通过以上命令获取容器进程 Pid，然后通过 nsenter 进入网络命名空间

nsenter -n -t <Pid>

进入之后即可进行相关的抓包等操作。

Example:

# docker inspect --format "{{.State.Pid}}" 9ad0c4037f8f
61507
# nsenter -n -t 61507
# ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0@if1475: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 3e:b2:29:f1:ce:aa brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.16.1.135/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::3cb2:29ff:fef1:ceaa/64 scope link
       valid_lft forever preferred_lft forever
# exit                 # 使用 exit 命令退出当前命名空间
logout

opskumu commented 6 years ago

2018-08-20~24

偷懒啦 👎 没脾气 🥇

blkid 获取磁盘 UUID
Ansible Ignore errors in tasks and fail at end of the playbook if any tasks had errors

opskumu commented 6 years ago

2018-08-27~31

kube-proxy with ipvs

https://github.com/kubernetes/kubernetes/tree/master/pkg/proxy/ipvs

CentOS7

设定主机名

hostnamectl set-hostname your-new-hostname

开机启动项

systemctl list-unit-files --type service

etcd V3

ETCDCTL_API=3 etcdctl get / --prefix --keys-only

Kubernetes 1.5.8 --> 1.6.x

兼容性说明

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.6.md#node-components

opskumu commented 6 years ago

2018-09-03~07

Nginx ingress metrics

当前 Nginx ingress metrics 有相应的 Grafana 官方模板：

https://github.com/kubernetes/ingress-nginx/tree/master/deploy/grafana/dashboards

0.16.0 版本后采用新的 metric 收集，而不是原来的 VTS 模块， New prometheus metric implementation (VTS module was removed)

Golang: JSON Marshalling empty slices as empty arrays instead of null

https://apoorvam.github.io/golang/json/marshal/slice/empty/null/2017/01/19/golang-json-marshalling.html

https://github.com/golang/go/issues/2278

ulimit soft & hard

针对 ulimit 限制，有一个 soft 和 hard 限制，相关的限制不管是文件句柄数或者是进程数，是不能高于 soft 值的，普通用户可以修改 soft 值上限，但是不能超过 hard 值。

opskumu commented 6 years ago

2018-09-10~14

➜  ~ cat ~/.ansible.cfg
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp
fact_caching_timeout = 86400

https://docs.ansible.com/ansible/latest/user_guide/playbooks_variables.html#fact-caching

Why is my SSH login slow?

opskumu commented 6 years ago

2018-09-17~21

istio https://istio.io/docs/setup/kubernetes/quick-start/
Error: etcdserver: mvcc: database space exceeded 问题解决

上周升级 K8s 的时候，etcd 出现数据库空间超额的问题：

# get current revision
$ rev=$(ETCDCTL_API=3 etcdctl --endpoints=:2379 endpoint status --write-out="json" | egrep -o '"revision":[0-9]*' | egrep -o '[0-9].*')
# compact away all old revisions
$ ETCDCTL_API=3 etcdctl compact $rev
compacted revision 1516
# defragment away excessive space
$ ETCDCTL_API=3 etcdctl defrag
Finished defragmenting etcd member[127.0.0.1:2379]
# disarm alarm
$ ETCDCTL_API=3 etcdctl alarm disarm
memberID:13803658152347727308 alarm:NOSPACE

https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/maintenance.md#space-quota

Kube-proxy with ipvs https://github.com/kubernetes/kubernetes/tree/master/pkg/proxy/ipvs#ipvs

opskumu commented 6 years ago

2018-09-25~30

他很懒，什么都没有留下 🥇

opskumu commented 6 years ago

2018-10-08~12

发现 Evicted 状态的 pod 不会自动删除，kube-controller-manager 可以通过 --terminated-pod-gc-threshold 选项设置全局 GC 阈值，默认为 12500 才会触发 GC。

Kubelet does not delete evicted pods

opskumu commented 6 years ago

2018-10-16~19

微服务部署的几种方式

滚动更新（RollingUpdate）

新旧版本同时对外提供服务，按照一定比例启动新服务关闭旧服务，直到应用全部更新完成。

灰度发布/金丝雀发布（Canary Deploy）、AB test

灰度发布指版本在黑白之间平滑过渡，AB test 是灰度发布的一种方式，按照一定规则让一部分用户继续用 A，一部分用 B，完全没有问题之后，让所有用户使用新版本 B。

AB test

蓝绿发布（Blue-Green）

部署应用新版本，测试无误导流到新版本。确保无误，删除旧版本。无需停机，但是资源消耗较大（两倍）

Blue Green Deployment, Canary Run, A/B testing difference

Kubernetes deployment strategies

macOS 词典增强

导入教程 --> 苹果Mac自带词典完美扩充

词典库 http://download.huzheng.org/zh_CN/
词典导入工具 Mac Dictionary Kit, a.k.a., DictUnifier

容器 CPU、内存百分比以及 CPU Load

opskumu commented 6 years ago

2018-10-22~26

cAdvisor Metrics 相关说明

Monitoring cAdvisor with Prometheus

容器时区问题

以为改容器时区是件很麻烦的事情（软链 /etc/localtime），没想到注入一个环境变量就好了，哎，无知啊（当然，最基本的是你容器中需要有相关时区文件，不然设置环境变量肯定是不生效的）

➜  ~ docker run -it --rm centos:6.7 bash
[root@5517ffc4c3e8 /]# date
Fri Oct 26 08:14:41 UTC 2018
[root@5517ffc4c3e8 /]# exit
exit
➜  ~ docker run -it -e TZ=Asia/Shanghai --rm centos:6.7 bash
[root@5a9414f30610 /]# date
Fri Oct 26 16:15:40 CST 2018
[root@5a9414f30610 /]#

herosea commented 6 years ago

手动点个赞 👍

opskumu commented 6 years ago

2018-10-29~11-02

Docker 内部添加额外网卡，如 VPN 网卡 `tun`

docker run -it -d --cap-add=NET_ADMIN --device /dev/net/tun:/dev/net/tun <镜像名>

Docker 只需添加 --cap-add=NET_ADMIN、--device /dev/net/tun:/dev/net/tun 选项即可。（可以直接通过 --privileged=true 赋予容器更多权限，但是不建议这么做，尽量最少权限。）

针对 K8s 则需要添加如下选项：

        securityContext:
          capabilities:
            add:
            - NET_ADMIN

因为 Kubernetes 不支持 --device 选项 add support for host devices，针对这种情况我们需要额外进行一些操作：

mkdir -p /dev/net
mknod /dev/net/tun c 10 200
chmod 600 /dev/net/tun

How to setup a VPN connection from inside a pod in Kubernetes Runtime privilege and Linux capabilities

K8s 卷挂载权限问题

非 root 用户运行容器挂载卷是以 root 的形式挂载的，导致读写权限有问题，可以通过 podSecurityContext 项解决：

  securityContext:
    fsGroup: 2000

Configure a Security Context for a Pod or Container

K8s 设计文档以及提案

Kubernetes Design Documents and Proposals

自己给自己点赞，嗯，这是一种信仰~ 💯

opskumu commented 5 years ago

2018-11-05~09

基于容器的云原生应用设计原则

构建时（Build time）
- Image Immutability Principle 镜像不变原则，同一个应用镜像可以分别部署在 Dev、Test、Pro 环境
- Single Concern Principle 单一职责原则，每个容器都解决一个问题并做得很好，换句话说一个容器运行一个进程
- Self-Containment Principle 自遏制原则，容器只依赖 Linux 内核，构建时添加其它库
运行时（Runtime）
- High Observability Principle 高可预测性原则，每个容器都必须实现所有必要的 API，以帮助平台以最佳方式观察和管理应用程序
- Lifecycle Conformance Principle 生命周期一致性原则，容器必须能够捕捉来自平台的事件，并对这些事件做出应对
- Process Disposability Principle 进程可处理原则，容器随时可被替代
- Runtime Confinement Principle 运行时限制原则，每个容器必须声明其资源限制（CPU、Memory 等）

Cloud Native Container Design Principles Cloud Native Container Design Whitepaper

删除 consul 服务

$ curl \
--request PUT \
http://127.0.0.1:8500/v1/agent/service/deregister/my-service-id

https://stackoverflow.com/questions/41818020/how-to-delete-a-consul-service

正确的删除一个 Kubernetes 节点

删除 Kubernetes 节点，通过 kubectl drain <node_name> 指令驱离已分配的 Pods，让系统重新调度，而不是直接删除 node。

Kubernetes audit 审计工具

K8GUARD -- The guardian angel for Kubernetes

chaoskube 给 K8s 搞事情

chaoskube periodically kills random pods in your Kubernetes cluster https://github.com/linki/chaoskube

kubectl

自动补全 kubectl 指令：

kubectl completion -h

通过以上指令获取 zsh/bash 下自动补全 kubectl 方式。

kubectl 结合 jq

kubectl get pods -n kube-system -o json | jq '.items[].metadata.name'
"gotty-4trf9"
... ...

# kubectl get pods -o json --all-namespaces | jq '.items |
group_by(.spec.nodeName) | map({"nodeName": .[0].spec.nodeName,
"count": length}) | sort_by(.count) | reverse'
[
  {
    "nodeName": "192.168.64.47",
    "count": 14
  },
  {
    "nodeName": "192.168.64.48",
    "count": 10
  },
  {
    "nodeName": "192.168.64.49",
    "count": 2
  }
]

https://jqplay.org/

https://stedolan.github.io/jq/manual/

kubectl alpha diff

opskumu commented 5 years ago

2018-11-12~16

容器内执行 `sed -i` 修改 `/etc/resolv.conf` 文件失败

主要原因是，sed -i 默认是 rename 的形式覆盖原文件的，而容器中 /etc/resolv.conf 此类文件都是以挂载的形式映射到容器的，以下文件 sed -i 修改都会失败：

[root@df667d093644 /]# mount | grep /dev/sda1
/dev/sda1 on /etc/resolv.conf type ext4 (rw,relatime,data=ordered)
/dev/sda1 on /etc/hostname type ext4 (rw,relatime,data=ordered)
/dev/sda1 on /etc/hosts type ext4 (rw,relatime,data=ordered)
[root@df667d093644 /]# sed -i 's/ //g' /etc/resolv.conf
sed: cannot rename /etc/sedeaaVp8: Device or resource busy

通过 sed man 帮助知道使用 -c 选项可以解决这个问题：

       -c, --copy
              use copy instead of rename when shuffling files in -i mode

加上 -c 选项，问题解决：

[root@df667d093644 /]# sed -c -i 's/ //g' /etc/resolv.conf
[root@df667d093644 /]#

How to Add Plugins to CoreDNS

https://coredns.io/2017/03/01/how-to-add-plugins-to-coredns/

opskumu commented 5 years ago

2018-11-19~23

springboot tomcat access log 获取真实 IP 地址

server.tomcat.access-log-pattern=%{X-Forwarded-For}i %l %u %t "%r" %s %b

https://stackoverflow.com/questions/36356612/how-to-log-the-real-client-ip-on-embedded-tomcat-access-log-on-spring-boot-appli?rq=1

opskumu commented 5 years ago

2018-11-26~30

CoreDNS 自定义 zone 解析

看了一下 CoreDNS 的源码，插件化做的相当优雅，CoreDNS 相对于 kubeDNS 要灵活多了

coredns.io:5300 {
    file db.coredns.io
}

example.io:53 {
    log
    errors
    file db.example.io
}

example.net:53 {
    file db.example.net
}

.:53 {
    kubernetes
    proxy . 8.8.8.8
    log
    errors
    cache
}

file 的格式如下（同 bind）

$ORIGIN example.org.
@   3600 IN SOA sns.dns.icann.org. noc.dns.icann.org. (
                2017042745 ; serial
                7200       ; refresh (2 hours)
                3600       ; retry (1 hour)
                1209600    ; expire (2 weeks)
                3600       ; minimum (1 hour)
                )

    3600 IN NS a.iana-servers.net.
    3600 IN NS b.iana-servers.net.

www     IN A     127.0.0.1
        IN AAAA  ::1

https://coredns.io/manual/toc/#configuration

https://github.com/coredns/coredns.io/blob/master/content/blog/custom-dns-and-kubernetes.md

Nginx ingress ssl-passthrough

指定 ingress 添加 annotation

nginx.ingress.kubernetes.io/secure-backends: "true"

https://github.com/kubernetes/ingress-nginx/issues/1947

opskumu commented 5 years ago

2018-12-03~07

Go Code Review Comments

Kubernetes RBAC

https://kubernetes.io/docs/reference/access-authn-authz/rbac/

关于 Go `resp.Body.Close`，为啥不读取 Body 还要执行 Close

is resp.Body.Close() necessary if we don't read anything from the body?

Go Gotcha: Closing a Nil HTTP Response Body With Defer

opskumu commented 5 years ago

2018-12-10~14

Docker Live Restore 特性分析

kubectl 执行指令中有管道

kubectl exec -it <pod> -- sh -c "<command> | <command>"

添加 -- 即可，官方解释如下（kubectl exec --help）：

# List contents of /usr from the first container of pod 123456-7890 and sort by modification time. # If the command you want to execute in the pod has any flags in common (e.g. -i), # you must use two dashes (--) to separate your command's flags/arguments. # Also note, do not surround your command and its flags/arguments with quotes # unless that is how you would execute it normally (i.e., do ls -t /usr, not "ls -t /usr"). kubectl exec 123456-7890 -i -t -- ls -t /usr 其中引号 sh -c 不受上文所述影响，必须添加引号

CoreDNS 替换 KubeDNS

CoreDNS deployment kubernetes

CoreDNS rewrite 规则

早期修改 KubeDNS 源码可以针对 service annotation 中的 domain 字段来设置解析，看 CoreDNS Kubernetes plugin 源码发现这种做法 CoreDNS 中是不可行的，针对这种可以通过 CoreDNS rewrite 规则来解决，如：

  Corefile: |
    .:53 {
        errors
        health
        rewrite name <domain>.<namespace>.svc.cluster.local <service_name>.<namespace>.svc.cluster.local
        kubernetes  cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          upstream
          fallthrough in-addr.arpa
        }
        prometheus :9153
        cache 30
        loop
        reload
        loadbalance
    }

这样就能达到在对应空间劫持 <domain> 域名到对应空间的应用。

注：实际使用中 <domain>、<service_name>、需要替换为实际值

题外话：issue 支持置顶也就罢了，置顶动作你还记录个啥啊，气死个人，逼死强迫症啊...

opskumu commented 5 years ago

2018-12-17~21

绿盟容器安全技术报告.pdf

ansible yum with_items 循环问题

[DEPRECATION WARNING]: Invoking "yum" only once while using a loop via squash_actions is deprecated. Instead of using a loop to supply multiple items and specifying `pkg: {{ item }}`

修改原始文件：

yum: pkg={{ item }} state=latest
with_items:
  - tuned
  - ceph-common

to -->

yum:
  name:
    - tuned
    - ceph-common
  state: latest

Using sysctls in a Kubernetes Cluster

Using sysctls in a Kubernetes Cluster

MySQL `explicit_defaults_for_timestamp` 选项

This system variable determines whether the server enables certain nonstandard behaviors for default values and NULL-value handling in TIMESTAMP columns. By default, explicit_defaults_for_timestamp is disabled, which enables the nonstandard behaviors. https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_explicit_defaults_for_timestamp

当选项为 0 时：

root@t1 10:50:25>desc t;
+-------+-----------+------+-----+-------------------+-------+
| Field | Type      | Null | Key | Default           | Extra |
+-------+-----------+------+-----+-------------------+-------+
| date  | timestamp | NO   |     | CURRENT_TIMESTAMP |       |
+-------+-----------+------+-----+-------------------+-------+
1 row in set (0.00 sec)

root@t1 10:51:07>insert into t(date) values(null);
Query OK, 1 row affected (0.01 sec)

root@t1 10:51:22>select * from t;
+---------------------+
| date                |
+---------------------+
| 2018-12-24 10:51:22 |
+---------------------+
1 row in set (0.00 sec)

root@t1 10:51:29>

当选项为 1 时：

root@t1 10:52:07>root@t1 10:52:07>insert into t(date) values(null);
ERROR 1048 (23000): Column 'date' cannot be null

explicit_defaults_for_timestamp 为只读选项，更改需要修改配置重启生效

CoreDNS 关于 Service `ExternalName` 类型指定 IP 不解析的问题

KubeDNS 在 Service 为 ExternalName 类型并且指定 IP 的情况下，会正常做解析的，但是替换成 CoreDNS 之后解析就失效了，具体详见 CoreDNS 的 issue The service with IP as ExternalName did not work，替代方式就如 issue 中所说创建对应 service 和 endpoints：

apiVersion: v1
kind: Service
metadata:
  name: gdns
  namespace: default
spec:
  clusterIP: None
  ports:
  - name: dns
    port: 53
    protocol: UDP
---
kind: Endpoints
apiVersion: v1
metadata:
  name: gdns
  namespace: default
subsets:
  - addresses:
      - ip: 8.8.8.8
    ports:
      - port: 53
        name: dns
        protocol: UDP

最新的官档（早期是没有添加相关说明的）在关于 ExternalName 为数字时加了特殊说明，建议如果是 IP 时通过 headless services 解决：

Note: ExternalName accepts an IPv4 address string, but as a DNS name comprised of digits, not as an IP address. ExternalNames that resemble IPv4 addresses are not resolved by CoreDNS or ingress-nginx because ExternalName is intended to specify a canonical DNS name. To hardcode an IP address, consider headless services. https://kubernetes.io/docs/concepts/services-networking/service/#externalname Kubernetes best practices: mapping external services 这篇文章也提到了映射外部服务为 IP 的解决方案，同样是 headless service

opskumu commented 5 years ago

2018-12-24~30

`ping` 延迟问题排查

现象就是 ping 一个内部域名（xxx.xxx.com）的时候在正式 ping 包之前等待时间过长

# ping xxx.xxx.com
// 实际已经完成域名解析了
PING xxx.xxx.com (192.168.10.11) 56(84) bytes of data.
// 但是解析完成之后需要等待一段时间之后才会响应
64 bytes from 192.168.10.11: icmp_seq=1 ttl=63 time=0.231 ms
....

只有 ping 的时候出现这种状态，单独使用 dig 解析域名非常快，没有问题。那么，不废话，祭出 strace 神器，看问题卡在什么地方了

# strace -f ping xxx.xxx.com
......
... sendto(4, "\7\205\203\0\1\0\0\0\1\0\1\00211\00210\003168\003192\7in-ad"..., 44, MSG_NOSIGNAL, NULL, 0) = 44
poll([{fd=4, events=POLLIN}], 1, 5000)  = 0 (Timeout)
gettimeofday({1545886622, 812281}, NULL) = 0
......
// 卡住的地方出现了反解相关的请求，反解超时了，DNS 默认是 5s 超时

问题是卡在反解 IP 上了，Google 搜索下 ping dns reverse 关键词 Ping domain name slow, ping ip fast, nslookup fast, what's the problem，确认相关问题

# ping -n xxx.xxx.com    // 通过加 -n 选项避免反解，发现流畅 ping 通

最后确认下内部 DNS 反解析相关，发现是内部 DNS 配置问题，解决。

以上地址相关的细节可以忽略，都是伪造的

rpc error: code = 14 desc = grpc: the connection is unavailable

话说这是个 Docker 的 Bug，测试环境遇到好几次了 https://github.com/moby/moby/issues/30984

Linux UDP 丢包排查思路

linux 系统 UDP 丢包问题分析思路

注意：dropwatch 命令轻易不要使用，可能会引起一些系统异常

如何通过 kubelet 设置注入容器日志（不支持 Docker）相关启动项

通过 Set Kubelet parameters via a config file 设定相关选项：

kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
containerLogMaxSize: "100Mi"

更多的选项参见 KubeletConfiguration，后续很多选项直接命令行指定已经不再生效了，需要通过配置指定。

https://github.com/kubernetes/kubernetes/pull/59898 CRIContainerLogRotation

Nginx ingress ssl-passthrough

之前的配置选项失效了，尴尬，导致了个小故障

后续都已经替换为 nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"

https://github.com/kubernetes/ingress-nginx/blob/master/docs/user-guide/nginx-configuration/annotations.md

opskumu commented 5 years ago

opskumu / issues

学习周报「2018」 #19

历年周报

Kubernetes tips #10 早期

学习周报「2018」 #19

学习周报「2019」 #23

学习周报「2020」 #26

CNUTCon 全球运维技术大会 2017 PPT 下载合集

拓展

《Docker in the Trenches》

Kubernetes

Docker 中关于 Java 的资源限制

2018-07-09~13

2018-07-16~20

2018-07-23~28

vim 最大化当前 pane

KubeDNS 间歇性解析问题

Kubernetes Job 重启策略

2018-07-30~08-03

2018-08-06~10

Pod 状态值

2018-08-13~17

docker system prune Bug

nsenter 进入 Docker 网络命名空间

2018-08-20~24

2018-08-27~31

kube-proxy with ipvs

CentOS7

etcd V3

Kubernetes 1.5.8 --> 1.6.x

2018-09-03~07

Nginx ingress metrics

Golang: JSON Marshalling empty slices as empty arrays instead of null

ulimit soft & hard

2018-09-10~14

2018-09-17~21

2018-09-25~30

2018-10-08~12

2018-10-16~19

微服务部署的几种方式

macOS 词典增强

容器 CPU、内存百分比以及 CPU Load

2018-10-22~26

cAdvisor Metrics 相关说明

容器时区问题

2018-10-29~11-02

Docker 内部添加额外网卡，如 VPN 网卡 tun

K8s 卷挂载权限问题

K8s 设计文档以及提案

2018-11-05~09

基于容器的云原生应用设计原则

删除 consul 服务

正确的删除一个 Kubernetes 节点

Kubernetes audit 审计工具

chaoskube 给 K8s 搞事情

kubectl

2018-11-12~16

容器内执行 sed -i 修改 /etc/resolv.conf 文件失败

How to Add Plugins to CoreDNS

2018-11-19~23

springboot tomcat access log 获取真实 IP 地址

2018-11-26~30

CoreDNS 自定义 zone 解析

Nginx ingress ssl-passthrough

2018-12-03~07

Kubernetes RBAC

关于 Go resp.Body.Close，为啥不读取 Body 还要执行 Close

2018-12-10~14

kubectl 执行指令中有管道

CoreDNS 替换 KubeDNS

CoreDNS rewrite 规则

2018-12-17~21

ansible yum with_items 循环问题

Using sysctls in a Kubernetes Cluster

MySQL explicit_defaults_for_timestamp 选项

CoreDNS 关于 Service ExternalName 类型指定 IP 不解析的问题

2018-12-24~30

ping 延迟问题排查

rpc error: code = 14 desc = grpc: the connection is unavailable

Linux UDP 丢包排查思路

`docker system prune` Bug

`nsenter` 进入 Docker 网络命名空间

Docker 内部添加额外网卡，如 VPN 网卡 `tun`

容器内执行 `sed -i` 修改 `/etc/resolv.conf` 文件失败

关于 Go `resp.Body.Close`，为啥不读取 Body 还要执行 Close

MySQL `explicit_defaults_for_timestamp` 选项

CoreDNS 关于 Service `ExternalName` 类型指定 IP 不解析的问题

`ping` 延迟问题排查