rockstor / rockstor-core

Linux/BTRFS based Network Attached Storage(NAS)
http://rockstor.com/docs/contribute_section.html
GNU General Public License v3.0
553 stars 137 forks source link

"DEV ERRORS DETECTED" on all disk pools #1973

Open barnhill opened 5 years ago

barnhill commented 5 years ago

After 3.9.2-41 update or sometime recently this error started showing up on both of my disk pools.

capture

I highly doubt the SSD and the Raid array separate from it are both bad at the exact same time.

phillxnet commented 5 years ago

@barnhill Thanks for your report, but those outputs are directly sourced from the following command:

btrfs dev stats /mnt2/pool-name

so hopefully your doubts should be address via the output of that command on each of the pools.

But note that this is not necessarily an indication of hardware error ie re: "I highly doubt the SSD and the Raid array separate from it are both bad at the exact same time." it is an indication of errors found at the filesystem level (although includes basic io as it goes) and possibly corrected by the file system (check your recent scrub reports). Also note each pool's details page which has more information on the affected drives within the pool, again referenced from the "btrfs dev stats" but this time using each drive as a parameter.

Hope that alleys your doubts and also note, as describe on each pool details page, that this is a cumulative report and you can reset it to zero (command to do so is documented on the pool details page as well).

Let us know the output from each of your pools for the quoted command and if it in fact tallies with the Web-UI. And keep in mind that a scrub can correct a number of these types of errors but be wary of a controller or disk that has 'all the errors' obviously.

Thanks again for your feedback and do be sure to check out each pool's details page (clicking on the pool name usually gets you there).

shocker2 commented 5 years ago

Hello, I'm seeing the same error, but on the output of the command there are none.

[root@terente ~]# btrfs dev stats /mnt2/rockstor_rockstor/ [/dev/md127].write_io_errs 0 [/dev/md127].read_io_errs 0 [/dev/md127].flush_io_errs 0 [/dev/md127].corruption_errs 0 [/dev/md127].generation_errs 0 [root@host ~]# btrfs dev stats /mnt2/mounting_point/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sde].write_io_errs 0 [/dev/sde].read_io_errs 0 [/dev/sde].flush_io_errs 0 [/dev/sde].corruption_errs 0 [/dev/sde].generation_errs 0 [/dev/sdf].write_io_errs 0 [/dev/sdf].read_io_errs 0 [/dev/sdf].flush_io_errs 0 [/dev/sdf].corruption_errs 0 [/dev/sdf].generation_errs 0 [/dev/sdg].write_io_errs 0 [/dev/sdg].read_io_errs 0 [/dev/sdg].flush_io_errs 0 [/dev/sdg].corruption_errs 0 [/dev/sdg].generation_errs 0 [/dev/sdh].write_io_errs 0 [/dev/sdh].read_io_errs 0 [/dev/sdh].flush_io_errs 0 [/dev/sdh].corruption_errs 0 [/dev/sdh].generation_errs 0 [/dev/sdi].write_io_errs 0 [/dev/sdi].read_io_errs 0 [/dev/sdi].flush_io_errs 0 [/dev/sdi].corruption_errs 0 [/dev/sdi].generation_errs 0 [/dev/sdj].write_io_errs 0 [/dev/sdj].read_io_errs 0 [/dev/sdj].flush_io_errs 0 [/dev/sdj].corruption_errs 0 [/dev/sdj].generation_errs 0 [/dev/sdk].write_io_errs 0 [/dev/sdk].read_io_errs 0 [/dev/sdk].flush_io_errs 0 [/dev/sdk].corruption_errs 0 [/dev/sdk].generation_errs 0 [/dev/sdl].write_io_errs 0 [/dev/sdl].read_io_errs 0 [/dev/sdl].flush_io_errs 0 [/dev/sdl].corruption_errs 0 [/dev/sdl].generation_errs 0 [/dev/sdm].write_io_errs 0 [/dev/sdm].read_io_errs 0 [/dev/sdm].flush_io_errs 0 [/dev/sdm].corruption_errs 0 [/dev/sdm].generation_errs 0 [/dev/sdn].write_io_errs 0 [/dev/sdn].read_io_errs 0 [/dev/sdn].flush_io_errs 0 [/dev/sdn].corruption_errs 0 [/dev/sdn].generation_errs 0 [/dev/sdo].write_io_errs 0 [/dev/sdo].read_io_errs 0 [/dev/sdo].flush_io_errs 0 [/dev/sdo].corruption_errs 0 [/dev/sdo].generation_errs 0 [/dev/sdp].write_io_errs 0 [/dev/sdp].read_io_errs 0 [/dev/sdp].flush_io_errs 0 [/dev/sdp].corruption_errs 0 [/dev/sdp].generation_errs 0 [/dev/sdq].write_io_errs 0 [/dev/sdq].read_io_errs 0 [/dev/sdq].flush_io_errs 0 [/dev/sdq].corruption_errs 0 [/dev/sdq].generation_errs 0 [root@host ~]#

Running v3.9.2-44

phillxnet commented 5 years ago

@shocker2 Thanks for potentially confirming this 'false alert'. Could you also, as per my request to @barnhill, confirm that the pool details page also shows no errors in it's breakdown of this command in the bottom of page table?

I now have an idea of what this might be and just need a little more info to narrow this down.

I am assuming you are seeing, as per @barnhill , both of your pools with this alert. In which case can you also execute as root the following command sequence:

btrfs dev stats -c /mnt2/rockstor_rockstor
echo $?

and again with you data pool:

btrfs dev stats -c /mnt2/mounting_point
echo $?

We are after the return codes for each of the commands in turn, hence it's important not to execute anything else in between the btrfs command and the echo command.

My suspicion is that this is, as yet, not a reliable indicator; and would explain the all zeros (assuming they are also displayed as all zeros in the two pools details page, along with the observed warning.

I.e. we use this documented return code (bitwise AND with 64) as a quick indicator rather than parsing the entire output of the command. But you and @barnhill may have systems where this is not a reliable indicator, I haven't seen this myself but it would be great to resolve the cause of this alert anomaly.

Also could you confirm you kernel version via

uname -a

Thanks again for your report. This is a relatively new feature and an important one so I would like to tend to this as soon as possible, pending your reply.

shocker2 commented 5 years ago

[root@host ~]# btrfs dev stats -c /mnt2/rockstor_rockstor [/dev/md127].write_io_errs 0 [/dev/md127].read_io_errs 0 [/dev/md127].flush_io_errs 0 [/dev/md127].corruption_errs 0 [/dev/md127].generation_errs 0 [root@host ~]# echo $? 0

[root@host ~]# btrfs dev stats -c /mnt2/path/ [/dev/sdc].write_io_errs 0 [/dev/sdc].read_io_errs 0 [/dev/sdc].flush_io_errs 0 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs 0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sde].write_io_errs 0 [/dev/sde].read_io_errs 0 [/dev/sde].flush_io_errs 0 [/dev/sde].corruption_errs 0 [/dev/sde].generation_errs 0 [/dev/sdf].write_io_errs 0 [/dev/sdf].read_io_errs 0 [/dev/sdf].flush_io_errs 0 [/dev/sdf].corruption_errs 0 [/dev/sdf].generation_errs 0 [/dev/sdg].write_io_errs 0 [/dev/sdg].read_io_errs 0 [/dev/sdg].flush_io_errs 0 [/dev/sdg].corruption_errs 0 [/dev/sdg].generation_errs 0 [/dev/sdh].write_io_errs 0 [/dev/sdh].read_io_errs 0 [/dev/sdh].flush_io_errs 0 [/dev/sdh].corruption_errs 0 [/dev/sdh].generation_errs 0 [/dev/sdi].write_io_errs 0 [/dev/sdi].read_io_errs 0 [/dev/sdi].flush_io_errs 0 [/dev/sdi].corruption_errs 0 [/dev/sdi].generation_errs 0 [/dev/sdj].write_io_errs 0 [/dev/sdj].read_io_errs 0 [/dev/sdj].flush_io_errs 0 [/dev/sdj].corruption_errs 0 [/dev/sdj].generation_errs 0 [/dev/sdk].write_io_errs 0 [/dev/sdk].read_io_errs 0 [/dev/sdk].flush_io_errs 0 [/dev/sdk].corruption_errs 0 [/dev/sdk].generation_errs 0 [/dev/sdl].write_io_errs 0 [/dev/sdl].read_io_errs 0 [/dev/sdl].flush_io_errs 0 [/dev/sdl].corruption_errs 0 [/dev/sdl].generation_errs 0 [/dev/sdm].write_io_errs 0 [/dev/sdm].read_io_errs 0 [/dev/sdm].flush_io_errs 0 [/dev/sdm].corruption_errs 0 [/dev/sdm].generation_errs 0 [/dev/sdn].write_io_errs 0 [/dev/sdn].read_io_errs 0 [/dev/sdn].flush_io_errs 0 [/dev/sdn].corruption_errs 0 [/dev/sdn].generation_errs 0 [/dev/sdo].write_io_errs 0 [/dev/sdo].read_io_errs 0 [/dev/sdo].flush_io_errs 0 [/dev/sdo].corruption_errs 0 [/dev/sdo].generation_errs 0 [/dev/sdp].write_io_errs 0 [/dev/sdp].read_io_errs 0 [/dev/sdp].flush_io_errs 0 [/dev/sdp].corruption_errs 0 [/dev/sdp].generation_errs 0 [/dev/sdq].write_io_errs 0 [/dev/sdq].read_io_errs 0 [/dev/sdq].flush_io_errs 0 [/dev/sdq].corruption_errs 0 [/dev/sdq].generation_errs 0 [root@host ~]# echo $? 0

Linux host 4.12.4-1.el7.elrepo.x86_64 #1 SMP Thu Jul 27 20:03:28 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

phillxnet commented 5 years ago

@shocker2 Thanks for the output.

OK doesn't look like our quick return code check was to blame then for this anomaly. Also: "Could you also, as per my request to @barnhill, confirm that the pool details page also shows no errors in it's breakdown of this command in the bottom of page table?"

It now looks like you have some kind of db lock issue possibly.

Take a look at /opt/rockstor/var/log/rockstor.log (less command should help) to see if you are getting any Django errors occurring there. Our code to surface this state is actually very simple:

https://github.com/rockstor/rockstor-core/blob/master/src/rockstor/fs/btrfs.py#L98-L113

So this is again quite a conundrum. Let me know the answer to the above pool details page, or a pic maybe, and any issues you find in that log file.

Thank for persevering with helping to track this one down.

shocker2 commented 5 years ago

screenshot 2018-11-23 at 19 58 20 screenshot 2018-11-23 at 19 57 55 screenshot 2018-11-23 at 19 57 16 Screenshots attached.

phillxnet commented 5 years ago

@shocker2 Thanks for the pics. This has got us a little further. I'm currently working on another issue now but I'll make some notes for me in the future or others to pick up from.

Issue notes: The "device stats unsupported" is the next clue here. Displayed by js Handlebars helper ioErrorStatsTableData: Handlebars.registerHelper('ioErrorStatsTableData', function (stats) { https://github.com/rockstor/rockstor-core/blob/master/src/rockstor/storageadmin/static/storageadmin/js/views/pool_details_layout_view.js#L452-L459 which fails over to this message if it finds it has no info to display, or the info fails a json test.

From the caveats section of the pull request that added this dev stats error surfacing: "Show warn on dashboard if errors occurs on IO or fs. Fixes #1532" #1958 we have: "

which would suggest a potential dev name resolution issue but we have displayed both by-id and temp names for all devices.

Pool model uses: dev_stats_zero(mnt_pt) Disk model uses: get_dev_io_error_stats(str(self.target_name)) Unit tests with a variety of test data are written for both.

vzz3 commented 5 years ago

After entering a subscription code an executing yum update I had the exact same issue as @shocker2. However, after a reboot the "DEV ERRORS DETECTED" message is not shown any more.

phillxnet commented 5 years ago

@vzz3 Thanks for your feedback on this one. Much appreciated.

@barnhill and @shocker2 I'd like to know if @vzz3 suggestion helps in your cases?

Also there is now a 3.9.2-49, as of a 2nd September. In your cases it may very well be worth given this version a try as it has a number of speedup that will particularly benefit large drive counts, most relevant here for @shocker2.

Haven't forgotten this issue but their has been some progress (hopefully) in what may be leading to this false alarm scenario.