pop-os / pop

A project for managing all Pop!_OS sources
https://system76.com/pop
2.46k stars 87 forks source link

problem with NVMe SSD Controller SM981/PM981/PM983 #775

Open ghost opened 4 years ago

ghost commented 4 years ago

Distribution (run cat /etc/os-release):

NAME="Pop!_OS" VERSION="19.10" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Pop!_OS 19.10" VERSION_ID="19.10" HOME_URL="https://system76.com/pop" SUPPORT_URL="http://support.system76.com" BUG_REPORT_URL="https://github.com/pop-os/pop/issues" PRIVACY_POLICY_URL="https://system76.com/privacy" VERSION_CODENAME=eoan UBUNTU_CODENAME=eoan LOGO=distributor-logo-pop-os

Related Application and/or Package Version (run apt policy $PACKAGE NAME):

NVMe SSD Controller SM981/PM981/PM983

Issue/Bug Description:

the laptop crashes after the ssd drive controller fails. kernel.log is flooded with error messages:

Dec  5 12:10:16 pop-os kernel: [ 8385.574748] pcieport 0000:00:1d.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Dec  5 12:10:16 pop-os kernel: [ 8385.574750] pcieport 0000:00:1d.0: AER:   device [8086:a330] error status/mask=00001000/00002000
Dec  5 12:10:16 pop-os kernel: [ 8385.574751] pcieport 0000:00:1d.0: AER:    [12] Timeout               
Dec  5 12:10:16 pop-os kernel: [ 8385.574756] nvme 0000:03:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec  5 12:10:16 pop-os kernel: [ 8385.574757] nvme 0000:03:00.0: AER:   device [144d:a808] error status/mask=00000001/0000e000
Dec  5 12:10:16 pop-os kernel: [ 8385.574759] nvme 0000:03:00.0: AER:    [ 0] RxErr                 
Dec  5 12:10:16 pop-os kernel: [ 8385.574859] nvme 0000:03:00.0: AER:   Error of this Agent is reported first
Dec  5 12:10:16 pop-os kernel: [ 8385.574880] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:03:00.0
Dec  5 12:10:16 pop-os kernel: [ 8385.574972] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:03:00.0
Dec  5 12:10:16 pop-os kernel: [ 8385.574982] nvme 0000:03:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec  5 12:10:16 pop-os kernel: [ 8385.574985] nvme 0000:03:00.0: AER:   device [144d:a808] error status/mask=00000001/0000e000
Dec  5 12:10:16 pop-os kernel: [ 8385.574986] nvme 0000:03:00.0: AER:    [ 0] RxErr                 
Dec  5 12:10:16 pop-os kernel: [ 8385.575084] pcieport 0000:00:1d.0: AER: Multiple Corrected error received: 0000:03:00.0
Dec  5 12:10:16 pop-os kernel: [ 8385.575094] nvme 0000:03:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Dec  5 12:10:16 pop-os kernel: [ 8385.575097] nvme 0000:03:00.0: AER:   device [144d:a808] error status/mask=00000001/0000e000
Dec  5 12:10:16 pop-os kernel: [ 8385.575098] nvme 0000:03:00.0: AER:    [ 0] RxErr   

Steps to reproduce (if you know):

the laptop will just crash randomly

Other Notes:

after digging a bit I am pretty sure that the problem is similar to this one: https://bugzilla.kernel.org/show_bug.cgi?id=195039 I have also a Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983. It just dies very randomly. Any updates from the kernel?

full description https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1852479

ferror18 commented 3 years ago

having similar issue on a thread ripper system using proxmox

D13410N3 commented 2 years ago

Having similar issue on Proxmox with one of Kingston NV1 NVME-drives (4 total) in a cheap NVME-controller from Aliexpress (bifurcation + softraid 10). It doesn't affect on my work, but error-count increase and that's annoying. I've heard theory that it's happening because of some unknown ATA-commands. But I have 4 same drives and problems exist only with one