[SATA ALPM] ssd hard drive error due to file system remount in read only

pvanhauw commented 10 years ago

I use linux mint 17 qiana 64bits with cinnamon. I installed a new ssd: the MX100 crucial 512 on a Samsung Ativ book 8 (Np870)

After sometime, but this also ALWAYS happens if the laptop stays idle and the screen is disactivated, the file system is remounted in read only because of an error.

You can find all the information here: http://forums.linuxmint.com/viewtopic.php?f=49&t=174315

The most important part is the dmesg:

1982.874590] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x40000 action 0x6 frozen [ 1982.874595] ata5: SError: { CommWake } [ 1982.874598] ata5.00: failed command: FLUSH CACHE EXT [ 1982.874602] ata5.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0 [ 1982.874602] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [ 1982.874604] ata5.00: status: { DRDY } [ 1982.874607] ata5: hard resetting link [ 1988.238907] ata5: link is slow to respond, please be patient (ready=0) [ 1992.890664] ata5: COMRESET failed (errno=-16) [ 1992.890670] ata5: hard resetting link [ 1998.254987] ata5: link is slow to respond, please be patient (ready=0) [ 2002.906743] ata5: COMRESET failed (errno=-16) [ 2002.906750] ata5: hard resetting link [ 2008.271052] ata5: link is slow to respond, please be patient (ready=0) [ 2037.975036] ata5: COMRESET failed (errno=-16) [ 2037.975042] ata5: limiting SATA link speed to 3.0 Gbps [ 2037.975044] ata5: hard resetting link [ 2043.003094] ata5: COMRESET failed (errno=-16) [ 2043.003101] ata5: reset failed, giving up [ 2043.003103] ata5.00: disabled [ 2043.003105] ata5.00: device reported invalid CHS sector 0 [ 2043.003114] ata5: EH complete [ 2043.003151] sd 4:0:0:0: [sda] Unhandled error code [ 2043.003153] sd 4:0:0:0: [sda]
[ 2043.003154] sd 4:0:0:0: [sda] Unhandled error code [ 2043.003156] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [ 2043.003158] sd 4:0:0:0: [sda] CDB: [ 2043.003163] sd 4:0:0:0: [sda]
[ 2043.003163] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [ 2043.003165] sd 4:0:0:0: [sda] CDB: [ 2043.003159] Write(10): 2a 00 [ 2043.003166] Write(10): 2a 00 0e a9 70 c8 00 00 08 00 [ 2043.003176] end_request: I/O error, dev sda, sector 245985480 [ 2043.003179] EXT4-fs warning (device sda8): ext4_end_bio:317: I/O error writing to inode 916782 (offset 0 size 4096 starting block 30748186) [ 2043.003183] Buffer I/O error on device sda8, logical block 3761689 [ 2043.003180] 0f 22 6f e0 00 00 08 00

linrunner commented 10 years ago

Hi,

some devices don't work reliably with ALPM. Try

SATA_LINKPWR_ON_BAT=max_performance

or

SATA_LINKPWR_ON_BAT=medium_power

ConvMe commented 10 years ago

Can confirm this issue on an Acer Aspire V5-573G. Linux Mint 17 Qiana 64 with cinnamon desktop and Crucial MX100 512GB. On battery i get the I/O error.

Changing "SATA_LINKPWR_ON_BAT" solved this problem for me. "medium_power" was good enough.

Thanks

linrunner commented 10 years ago

@pvanhauw: did my suggested workaround help?

lupinix commented 10 years ago

Had the same issue with ThinkPad R400 and MX100 512GB, SATA_LINKPWR_ON_BAT=max_performance solved the problem

falense commented 10 years ago

@linrunner I can tentatively confirm that your workaround helps. I had the same issues (Asus UX32LN + MX100 512GB + Linux Mint 17) and have been trying to reproduce the error for the past 3 days. So far no crashes. For me the crashes were seemingly random and somewhat far apart, I will report back in another couple of days.

linrunner commented 10 years ago

@sondree: which of the two suggested values?

falense commented 10 years ago

@linrunner SATA_LINKPWR_ON_BAT=medium_power

halocaridina commented 10 years ago

FYI for others running across this issue:

An upstream report can be found at: https://bugzilla.kernel.org/show_bug.cgi?id=72191

Please note Comment #23; specifically, the "medium_power" workaround for laptop models such as the Lenovo T440S appears to be SSD model sensitive/specific. So be sure to carefully track your journal/logs if using an ALPM setting besides "max-performance".

linrunner commented 10 years ago

Hi guys,

could you post some more dmesg snippets so i can design a regexp for this?

I'm considering to add a warning to tlp-stat output.

halocaridina commented 10 years ago

Just searched my journal and no longer have any of the past errors available for posting (and rather not force the issue by inducing them). However, they were similar/identical to those under the initial post here:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/539467

linrunner commented 10 years ago

I have implemented a check for the above errors in tlp-stat – sample output:

+++ Warnings
* Kernel log shows ata errors (2) possibly caused by the configuration: SATA_LINKPWR_ON_AC/BAT=min/medium_power
  --> Consider using medium_power or max_performance instead!
  --> Check yourself with:
      dmesg | egrep -A 5 "ata[0-9]+: SError: { .*CommWake }"

pvanhauw commented 10 years ago

Yep, I confirm

SATA_LINKPWR_ON_BAT=max_performance

worked. I have not done extensive testing with the medium option yet.

Pierre

2014-09-03 19:14 GMT+02:00 linrunner notifications@github.com:

@pvanhauw https://github.com/pvanhauw: did my suggested workaround help?

— Reply to this email directly or view it on GitHub https://github.com/linrunner/TLP/issues/84#issuecomment-54331575.

falense commented 10 years ago

@linrunner @pvanhauw I have been using medium_power on my setup for the past 2 weeks. No errors related to this so far, seems all good :)

linrunner commented 10 years ago

Check released with 0.6.

linrunner commented 10 years ago

I leave this open. More reports are welcome.

saschalalala commented 9 years ago

Thinkpad L420, Fedora 22, same problem with a Crucial MX100, set in on max_power and it works.

sdh4 commented 9 years ago

Thinkpad Yoga, Fedora 22, same problem with a Crucial MX100, set in on max_power and it works.

I also note a number of comments on the Crucial support forums mentioning system stability or slowdowns under Windows until SATA link power management is disabled.

I plan on commenting on this kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=89261 and suggesting a blacklist entry the SATA link power management for the MX100.

dezull commented 9 years ago

On Thinkpad T450 with MX100 crucial 512 SSD, with SATA_LINKPWR_ON_BAT=min_power, if I put the laptop on sleep first before unplugging power source, it will work normally after that.

UrsMetz commented 9 years ago

I'm running Debian Jessie on a Thinkpad T440s with an ordinary HDD (i.e. no SSD). When switching to SATA_LINKPWR_ON_BAT=medium_power I get the following in dmes --human:

[Nov 8 11:47] ata1.00: exception Emask 0x10 SAct 0x6000000 SErr 0x50000 action 0xe frozen
[  +0,000005] ata1.00: irq_stat 0x00400000, PHY RDY changed
[  +0,000002] ata1: SError: { PHYRdyChg CommWake }
[  +0,000003] ata1.00: failed command: WRITE FPDMA QUEUED
[  +0,000004] ata1.00: cmd 61/08:c8:38:09:44/00:00:15:00:00/40 tag 25 ncq 4096 out
         res 50/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[  +0,000002] ata1.00: status: { DRDY }
[  +0,000002] ata1.00: failed command: WRITE FPDMA QUEUED
[  +0,000004] ata1.00: cmd 61/08:d0:28:0b:91/00:00:15:00:00/40 tag 26 ncq 4096 out
         res 50/00:03:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[  +0,000001] ata1.00: status: { DRDY }

When using SATA_LINKPWR_ON_BAT=max_performance the error messages disappear. Currently tlp-stat does not issue a warning (I'm using TLP from http://repo.linrunner.de/debian, I guess it is version 0.8). So I'm providing my dmes output as requested by @linrunner :-).

UrsMetz commented 9 years ago

I just found out you have to activate warnings (tlp-stat -w) in order to see the message. So the check also works on my machine, please ignore my previous comment.

linrunner commented 9 years ago

Hi Urs, nevertheless thanks for your report.

Since 0.8 [1] tlp-stat without parameters and tlp-stat -d should both produce the warnings section at the end of the output. I'd appreciate if you could retest this.

[1] https://github.com/linrunner/TLP/blob/master/tlp-stat.in#L1103

UrsMetz commented 9 years ago

@linrunner I've just retested it: both tlp-stat without parameters and tlp-stat -d issue the warning as expected. I could be that when I checked this morning I had already switched back to SATA_LINKPWR_ON_BAT=max_performance.

linrunner commented 9 years ago

@UrsMetz: thanks for your reassuring feedback.

nlowe commented 8 years ago

Just got bit by this on vacation this past weekend, System76 Kudu Professional, with an aftermarket Crucial MX100 512gb installed by myself. Wish I found this before spending the past 6.5 hours doing fsck -vccfk /dev/sdb1 ! Didn't even make the connection that I only got hit by this on battery... It's late my time but I'll try this out tomorrow, seems like a possible fix.

EDIT: Yup, the medium_performance setting works for me! Thanks! :+1:

jwrdegoede commented 7 years ago

Hi All,

I've been working on a kernel patch adding a new SATA LPM policy: med_power_with_dipm, which matches the power-management defaults from the Intel RST Windows drivers. It would be interesting if people who where having issues with min_power, but are fine with medium_power could test this new policy, it saves almost as much power as min_power and hopefully, since it mimicks Windows, should not hit any SSD firmware bugs like min_power sometimes does.

For more info see: https://hansdegoede.livejournal.com/18412.html

Regards,

Hans

UrsMetz commented 7 years ago

@jwrdegoede Is there any easy way to test your patch on Debian (in fact jessie as I haven't updated yet)? I just had a quick look at your blog post and you are only mentioning about Fedora.

jwrdegoede commented 7 years ago

Hi,

On 15-09-17 22:25, Urs Metz wrote:

@jwrdegoede https://github.com/jwrdegoede Is there any easy way to test your patch on Debian (in fact jessie as I haven't updated yet)? I just had a quick look at your blog post and you are only mentioning about Fedora.

Nope, sorry you will need to build a kernel with the patch yourself.

Perhaps someone with Debian experience is reading along and can do a pre-patched kernel pkg like I've done for Fedora ?

Regards,

Hans

linrunner commented 7 years ago

I've uploaded patched Ubuntu kernel packages here.

They won't work with Debian however, because kernel infrastructure is a bit different there.

ps. and yes, you may specify

SATA_LINKPWR_ON_AC=med_power_with_dipm
SATA_LINKPWR_ON_BAT=med_power_with_dipm

with any version of TLP :-).

jwrdegoede commented 7 years ago

Hi,

On 18-10-17 19:42, linrunner wrote:

I've uploaded patched Ubuntu kernel packages here http://download.linrunner.de/packages/.

They won't work with Debian however, because kernel infrastructure is a bit different there.

Cool, thank you!

It would be great if someone with one of the affected SSDs could test this kernel with the new med_power_with_dipm setting. This saves almost as much power as min_power and if this turns out to be save for more disks / SSDs it might make a better default going forward.

Regards,

Hans

UrsMetz commented 7 years ago

@jwrdegoede I just reread part of this issue and asking myself whether the patch is only relevant (and thus the power saving only happens) for SSD and not good ol' HDD? I'm still using a HDD. In case this is also relevant for HDD I might give your patch a try but I can't promise I'll find the time to create a patched kernel and test it.

linrunner commented 7 years ago

The patch is relevant for HDD too.

jwrdegoede commented 7 years ago

Hi,

On 18-10-17 20:01, Urs Metz wrote:

@jwrdegoede https://github.com/jwrdegoede I just reread part of this issue and asking myself whether the patch is only relevant (and thus the power saving only happens) for SSD and not good ol' HDD? I'm still using a HDD. In case this is also relevant for HDD I might give your patch a try but I can't promise I'll find the time to create a patched kernel and test it.

The powersaving should be about the same on HDD and testing with HDDs also is good to have. Although the main thing I'm interested in from this specific github issue is testing with the Crucial SSDs which was triggering the issue as originally described.

Regards,

Hans

halocaridina commented 7 years ago

All,

Thought it might be helpful to comment on my this issue given my posts on it from Sept. 2014 above.

Currently have four Lenovo Thinkpads (models T440s (referenced in the Sept 2014 posts), X200s, X220 and X250) all running Arch Linux and TLP. All of them have Crucial MX300 SSDs of either 275GB or 525GB capacities on the most current firmware (M0CR060) and running SATA_LINKPWR_ON_BAT=min_power without any issues whatsoever.

Given the different ages of the machines, BIOS vs. UEFI, etc but the fact that they are all utilizing the same make and model of SSD suggests that (at least) the Crucial MX300 is not plagued by this issue. Perhaps the SSD technology has evolved in the last 3+ years to a point where this setting will no longer be a problem for SSDs (or at least some number of them)?

Cheers, Halocaridina

jwrdegoede commented 6 years ago

According to a bug report which I just received, this still seems to be happening on Crucial MX100 SSDs with my new med_power_with_dipm policy. So I'm going to add a LPM blacklist entry for this SSD to the kernel.

I notice that all reporters in this and other bugs about the MX100 have the 512GB model, or are not specifying their SSD's size. If you've seen this problem with a Crucial MX100 which is not 512GB, please let me know ASAP, as for now I plan to limit the blacklist entry to the 512GB model.

I would also appreciate the output of: "dmesg | grep Crucial" from machines where people have seen this problem.

khfeng commented 6 years ago

Hans, is the specific model CT500BX100SSD1?

Here's another bug report [1] about that particular model. I am also waiting for another user's feedback.

If you don't mind, can you let me send the patch, also help to review the patch? Thanks.

[1] https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1726930

jwrdegoede commented 6 years ago

Hi,

On 14-02-18 12:01, Kai-Heng Feng wrote:

Hans, is the specific model CT500BX100SSD1?

Here's another bug report [1] about that particular model. I am also waiting for another user's feedback.

If you don't mind, can you let me send the patch, also help to review the patch? Thanks.

[1] https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1726930

No it is a Crucial_CT512MX100, so a MX100 not a BX100, which AFAIK are quite different models. The patch I'm preparing for it looks like this:

--- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4530,6 +4530,11 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { { "PIONEER DVD-RW DVR-212D", NULL, ATA_HORKAGE_NOSETXFER }, { "PIONEER DVD-RW DVR-216D", NULL, ATA_HORKAGE_NOSETXFER },

/ The 512GB version of the MX100 has both queued TRIM and LPM issues /
{ "Crucial_CT512MX100*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM |
ATA_HORKAGE_NOLPM, },

 /* devices that don't properly handle queued TRIM commands */
 { "Micron_M500_*",              NULL,   ATA_HORKAGE_NO_NCQ_TRIM |
                                         ATA_HORKAGE_ZERO_AFTER_TRIM, },

Note it would be good to get people to test with 4.14+ and med_power_with_dipm, at least for an MX300 I've reports of that fixing issues which are seen when using min_power.

Regards,

Hans

khfeng commented 6 years ago

On 14 Feb 2018, at 7:24 PM, Hans de Goede notifications@github.com wrote:

Hi,

On 14-02-18 12:01, Kai-Heng Feng wrote:

Hans, is the specific model CT500BX100SSD1?

Here's another bug report [1] about that particular model. I am also waiting for another user's feedback.

If you don't mind, can you let me send the patch, also help to review the patch? Thanks.

[1] https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1726930

No it is a Crucial_CT512MX100, so a MX100 not a BX100, which AFAIK are quite different models. The patch I'm preparing for it looks like this:

Thanks. I’ll send a separate patch for the model in Launchpad bug.

--- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4530,6 +4530,11 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { { "PIONEER DVD-RW DVR-212D", NULL, ATA_HORKAGE_NOSETXFER }, { "PIONEER DVD-RW DVR-216D", NULL, ATA_HORKAGE_NOSETXFER },

/ The 512GB version of the MX100 has both queued TRIM and LPM issues /

{ "Crucial_CT512MX100*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |

ATA_HORKAGE_ZERO_AFTER_TRIM |

ATA_HORKAGE_NOLPM, },

/ devices that don't properly handle queued TRIM commands / { "MicronM500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM, },

Note it would be good to get people to test with 4.14+ and med_power_with_dipm, at least for an MX300 I've reports of that fixing issues which are seen when using min_power.

Hmm, maybe let distro kernels use med_power_with_dipm as default through CONFIG_SATA_MOBILE_LPM_POLICY?

Regards,

Hans — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jwrdegoede commented 6 years ago

Hi,

On 14-02-18 20:32, Kai-Heng Feng wrote:

On 14 Feb 2018, at 7:24 PM, Hans de Goede notifications@github.com wrote:

Hi,

On 14-02-18 12:01, Kai-Heng Feng wrote:

Hans, is the specific model CT500BX100SSD1?

Here's another bug report [1] about that particular model. I am also waiting for another user's feedback.

If you don't mind, can you let me send the patch, also help to review the patch? Thanks.

[1] https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1726930

No it is a Crucial_CT512MX100, so a MX100 not a BX100, which AFAIK are quite different models. The patch I'm preparing for it looks like this:

Thanks. I’ll send a separate patch for the model in Launchpad bug.

Before you do so, please double check it is broken even with med_power_with_dipm…

--- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -4530,6 +4530,11 @@ static const struct ata_blacklist_entry ata_device_blacklist [] = { { "PIONEER DVD-RW DVR-212D", NULL, ATA_HORKAGE_NOSETXFER }, { "PIONEER DVD-RW DVR-216D", NULL, ATA_HORKAGE_NOSETXFER },

/ The 512GB version of the MX100 has both queued TRIM and LPM issues /

{ "Crucial_CT512MX100*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |

ATA_HORKAGE_ZERO_AFTER_TRIM |

ATA_HORKAGE_NOLPM, },

/ devices that don't properly handle queued TRIM commands / { "MicronM500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM | ATA_HORKAGE_ZERO_AFTER_TRIM, },

Note it would be good to get people to test with 4.14+ and med_power_with_dipm, at least for an MX300 I've reports of that fixing issues which are seen when using min_power.

Erm, that 4.14+ above should be 4.15+, sorry.

Hmm, maybe let distro kernels use med_power_with_dipm as default through CONFIG_SATA_MOBILE_LPM_POLICY?

Yes that is the whole purpose of CONFIG_SATA_MOBILE_LPM_POLICY, note that I authored the patch adding that new Kconfig option :)

Regards,

Hans

khfeng commented 6 years ago

On 15 Feb 2018, at 5:05 AM, Hans de Goede notifications@github.com wrote: Before you do so, please double check it is broken even with med_power_with_dipm…

Yes I can confirm it. I wrote a new quirk that will fallback to med_power_with_dipm when min_power gets selected.

The user confirmed it’s med_power_with_dipm [1] but the same issue happened.

Yes that is the whole purpose of CONFIG_SATA_MOBILE_LPM_POLICY, note that I authored the patch adding that new Kconfig option :)

That’s good to know. Are you going to use default 3 (med_power_with_dipm) on Fedora?

[1] https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1726930/comments/30

Kai-Heng

Regards,

Hans — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

jwrdegoede commented 6 years ago

Hi,

On 15-02-18 14:20, Kai-Heng Feng wrote:

Yes that is the whole purpose of CONFIG_SATA_MOBILE_LPM_POLICY, note that I authored the patch adding that new Kconfig option :)

That’s good to know. Are you going to use default 3 (med_power_with_dipm) on Fedora?

Yes that is the plan.

Regards,

Hans

linrunner commented 4 years ago

All kernels should be patched by now.

linrunner / TLP

[SATA ALPM] ssd hard drive error due to file system remount in read only #84