Closed TheLogFather closed 7 years ago
Cloud logs still showing as empty... [bump, I guess]
/cc @colbygk @rschamp
Checking on the monitor host:
There are loads of shard errors, with what appear to be filesystem level I/O read errors.
Steps taken:
1) Unmounted the /var/lib/elasticsearch filesystem (/dev/sdb1
)
fsck -y /dev/sdb1
Lots and lots and lots of correctable errors reported.
2) Remounted /var/lib/elasticsearch
3) New errors showed up, and, filesystem went read-only as soon as any writes occurred (from CLI operations, elasticsearch never went green/did not initialize shard properly)
4) /dev/sdb1
is a RAID10, 4 x 111GB Intel SSDs (donated several years before)
These appear to be genuine read errors of some sort. Rebuilding disk from scratch:
1) copied /var/lib/elasticsearch/scratch-prod-elasticsearch/nodes/0/_state
to /tmp
2) umount /var/lib/elasticsearch
3) mkfs -t ext4 /dev/sdb1
4) mount /var/lib/elasticsearch
5) copied _state
back to /var/lib/elasticsearch/scratch-prod-elasticsearch/nodes/0/
with correct ownership/permissions
6) restarted elasticsearch
7) no more shard errors
root@:/var/lib/elasticsearch/scratch-prod-elasticsearch/nodes/0# curl -s -XGET http://localhost:9200/_cluster/health?pretty | awk '{print " "$0}' { "cluster_name" : "scratch-prod-elasticsearch", "status" : "yellow", "timed_out" : false, "number_of_nodes" : 1, "number_of_data_nodes" : 1, "active_primary_shards" : 5, "active_shards" : 5, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 5 } root@:/var/lib/elasticsearch/scratch-prod-elasticsearch/nodes/0# curl -s -XGET http://localhost:9200/_cat/shards | awk '{print " "$0}' cloudlogs 2 p STARTED 6013 3.7mb ... cloudlogs 2 r UNASSIGNED cloudlogs 0 p STARTED 5954 7.2mb ... cloudlogs 0 r UNASSIGNED cloudlogs 3 p STARTED 6049 3.4mb ... cloudlogs 3 r UNASSIGNED cloudlogs 1 p STARTED 6045 3.5mb ... cloudlogs 1 r UNASSIGNED cloudlogs 4 p STARTED 6008 3.2mb ... * cloudlogs 4 r UNASSIGNED
Time to onset of heavier than normal load on the host appears to corrispond with power interruption on June 1-2nd.
Went to https://scratch.mit.edu/cloudmonitor/<projectid>/
and confirmed that cloud data logging working again...
I'm getting "no Cloud data activity" at the moment, for all projects I've looked at. (Even though I can see cloud is being updated for the projects.)