data corruption in zram with linux-2.6.37 vanilla kernel

GoogleCodeExporter commented 9 years ago

maybe I used the wrong thread at first (#82), so here is a separate one:

the attached code will show data corruption in the zram device:

after a few cycles you will get e.g.:
cycle 8:
536184  tmpmnt
62398464
tmpmnt/tmpfile zram0mnt/f differ: char 4097, line 64

warning: this code may lead to oom conditions and therefore crash your system

I also tried hg pull from 2011/01/25 and copied the following files into the 
kernel tree (in order to have a static (non-modular) version):

zram_drv.c
zram_sysfs.c
sub-projects/allocators/xvmalloc-kmod/xvmalloc.c

The problem occurs with and without CONFIG_HIGHMEM4G

hth.

Original issue reported on code.google.com by fadb24bb...@drewag.de on 26 Jan 2011 at 12:35

Attachments:

zramtest.sh

GoogleCodeExporter commented 9 years ago

Can you see these messages in log (/var/log/messages):
 "zram: Error allocating memory for compressed page: xxx"

If zram fails to allocate memory for incoming pages, write fails and you will 
get data mismatch as in your test.  Its NOT data corruption.

Anyways, please upload you log file so we may look into this further.

Original comment by nitingupta910@gmail.com on 27 Jan 2011 at 3:35

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

There is no such message, not in the log, not on the console and not readable 
with dmesg.
Even if it was the case, I would rather expect an error from the block layer 
instead
of silent data corruption. Applications don't read logfiles :-).

btw. some history: When 2.6.37 came up with the "real" implementation of zram, I
made a filesys in the zram device and tried to compile a kernel in it (always a 
good stresstest :-). The system had 2G ram, and there have been no oom 
conditions. But the
Compiler failed with an "impossible" error. To track down the problem I wrote 
this script. Another version of the script tried the same without a filesystem, 
but it tends to crash the system before showing the error. Anyway, I attach it 
here.

regards

Original comment by fadb24bb...@drewag.de on 28 Jan 2011 at 9:40

Added labels: ****
Removed labels: ****

Attachments:

zramtest_nofs.sh

GoogleCodeExporter commented 9 years ago

I just compiled kernel over zram with disksize of 4G -- no problems at all. 
With 2G disksize, I got "No space left on device error" and no zram memory 
allocation error in logs (which is a good thing). So, maybe you just ran "out 
of disk space" when compiling kernel over zram?

Also, I tried with random data test as in your script -- no problems again:

$ openssl rand -base64 "$((1*1024*1024*1024))" > ~/temp/rand.orig
cp ~/temp/rand.orig ./rand  # copied to mounted /dev/zram0 of 4G disksize       

$ md5sum ~/temp/rand.orig # version on disk
aaed1e376f3a9332fd3ad5ce07f19d37  /home/ngupta/temp/rand.orig

$ md5sum rand # version on zram
aaed1e376f3a9332fd3ad5ce07f19d37  rand

Original comment by nitingupta910@gmail.com on 2 Feb 2011 at 1:49

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Ran your script as-is, still no problems found:

3468316672
0       tmpmnt
0
             total       used       free     shared    buffers     cached
Mem:      16400548    8236912    8163636          0     198328    6418696
-/+ buffers/cache:    1619888   14780660
Swap:     18481148          0   18481148
cycle 0:
59636   tmpmnt
60940288
cmprc is cmp: EOF on tmpmnt/tmpfile
             total       used       free     shared    buffers     cached
Mem:      16400548    8357396    8043152          0     198336    6478360
-/+ buffers/cache:    1680700   14719848
Swap:     18481148          0   18481148
cycle 1:
119272  tmpmnt
60940288
cmprc is cmp: EOF on tmpmnt/tmpfile
             total       used       free     shared    buffers     cached
Mem:      16400548    8416540    7984008          0     198336    6537604
-/+ buffers/cache:    1680600   14719948
Swap:     18481148          0   18481148
cycle 2:
178908  tmpmnt

Original comment by nitingupta910@gmail.com on 2 Feb 2011 at 1:54

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Very strange. Of course your mileage may vary dependend on memory and the 
parameters in the script. May I have missed some update bit ? In order to get 
(hopefully) really reproducable conditions I made a qemu image. Can you please 
unpack and run it with
"qemu -m 384 -hda hdaz". You will find the kernel config in /boot/... . The 
image is also mountable with "mount -ro loop,offset=32256 hdaz"

regards

Hmmm, upload ist restricted - please collect and run
   "cat xaa xab xac xad | gunzip >hdaz"

Original comment by fadb24bb...@drewag.de on 3 Feb 2011 at 1:42

Added labels: ****
Removed labels: ****

Attachments:

xaa

GoogleCodeExporter commented 9 years ago

part #2

Original comment by fadb24bb...@drewag.de on 3 Feb 2011 at 1:43

Added labels: ****
Removed labels: ****

Attachments:

xab

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

part #3

Original comment by fadb24bb...@drewag.de on 3 Feb 2011 at 1:44

Added labels: ****
Removed labels: ****

Attachments:

xac

GoogleCodeExporter commented 9 years ago

finally: part #4

Original comment by fadb24bb...@drewag.de on 3 Feb 2011 at 1:45

Added labels: ****
Removed labels: ****

Attachments:

xad

GoogleCodeExporter commented 9 years ago

Thanks for the VM image. I tested this on another 32-bit (Fedora) VM and 
strangely enough it happens consistently on any 32-bit system and NOT on 64-bit.

Original comment by nitingupta910@gmail.com on 4 Feb 2011 at 6:14

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Found a bug which was causing read/write from/to incorrect sectors. Can you try 
the patch attached? (all tests now pass on my side)

Original comment by nitingupta910@gmail.com on 5 Feb 2011 at 11:54

Added labels: ****
Removed labels: ****

Attachments:

zram_fix_issue_83.patch

GoogleCodeExporter commented 9 years ago

I have committed this change to the repository and gregkh promised it would be 
included in 2.6.38 and probably in maintainance release of 2.6.37 too.

Please reopen if you still hit this issue.

Original comment by nitingupta910@gmail.com on 8 Feb 2011 at 1:51

Changed state: Fixed
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Confirm it is working. Script does not fail and kernel compiles successfully 
:-).
Thanks for your effort.

Original comment by fadb24bb...@drewag.de on 8 Feb 2011 at 3:25

Added labels: ****
Removed labels: ****

mfrw / compcache

data corruption in zram with linux-2.6.37 vanilla kernel #83