juicedata / juicefs

JuiceFS is a distributed POSIX file system built on top of Redis and S3.
https://juicefs.com
Apache License 2.0
10.85k stars 957 forks source link

Data corruption: zeroes are sometimes written instead of the actual data #5038

Open abbradar opened 3 months ago

abbradar commented 3 months ago

What happened: When writing a file on a host, sometimes a zero-filled region may be written instead of the actual data.

What you expected to happen: No data corruption.

How to reproduce it (as minimally and precisely as possible):

On one host: pv -L 3k /dev/urandom > /mnt/jfs-partition/test-file On another host: while ! hexdump -C /mnt/jfs-partition/test-file | grep '00 00 00 00'; do sleep 1; done

After some time, the command on the second host will find a large cluster of zeroes (>1k) and stop. In the file, you see, for instance:

00033130  26 32 91 00 00 00 00 00  00 00 00 00 00 00 00 00  |&2..............|
00033140  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00036000  88 ff a0 0b c6 17 26 95  78 b7 e3 28 f5 35 8b 98  |......&.x..(.5..|

Anything else we need to know?

More details:

Unfortunately, we have many irrelevant logs on the writing server because this is a production host. These ones may be relevant (the file inode is 2439005):

2024/07/30 07:53:45.891705 juicefs[4019926] <ERROR>: write inode:2439005 indx:0  input/output error [writer.go:211]
2024/07/30 07:53:52.457389 juicefs[4019926] <WARNING>: slow request: PUT chunks/20/20750/20750481_3_4194304 (req_id: "", err: Put "https://somehost/somedb/%2Fchunks%2F20%2F20750%2F20750481_3_4194304": write tcp 10.42.43.15:60050->100.118.102.36:443: use of closed network connection, cost: 59.994437087s) [cached_store.go:667]
[mysql] 2024/07/30 07:54:35 packets.go:37: read tcp 10.42.43.15:47762->10.0.8.236:3306: read: connection reset by peer
2024/07/30 07:54:35.379260 juicefs[4019926] <WARNING>: Upload chunks/20/20750/20750498_5_4194304: timeout after 1m0s: function timeout (try 1) [cached_store.go:407]
2024/07/30 07:54:39.800111 juicefs[4019926] <INFO>: slow operation: flush (2439005,17488,6D9AF143D1C50E5B) - input/output error <53.869292> [accesslog.go:83]

What else do we plan to try:

Environment:

jiefenghuang commented 3 months ago

It seems similar to one previous issue, it could be caused by concurrent chunk commit that lead to file length changes. Would it be normal reading again after a while?

abbradar commented 3 months ago

No, the file that was corrupted in this way stays the same. I just checked the one that I tested this on. The flow has been:

jiefenghuang commented 3 months ago

@abbradar This is not a bug. JuiceFS provides close-to-open consistency, once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client.

  1. If one client is writing while another client is reading simultaneously, inconsistencies may occur. You should expect to obtain valid data through "read-after-flush", not "read-after-write".
  2. If the client write failed due to object storage (or other) issues, this part of file is meaningless. You should not expect to read valid data from it.
abbradar commented 3 months ago

Additionally, if the client write failed due to object storage issues, the behavior at the point of write failure is undefined, the file may contain zero region.

Do you mean that if the object storage itself exhibits this behavior and may leave corrupted zero-filled objects, then this propagates to the FS level? I would have expected the region to be random garbage or for a read error to happen then because we use JuiceFS encryption on top; please correct me if I'm wrong.

BTW, does JuiceFS use the Content-MD5 header during uploads for S3-compatible storage?

EDIT:

JuiceFS provides close-to-open consistency, once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client.

Sorry I didn't ask this at first, but just to be clear — do these inconsistencies only include synchronization (say, if we update different parts of the file, the order may not be preserved, or the file might not be updated at all until closed), or may lead to the client reading completely invalid data (zeroes or other garbage)?

abbradar commented 3 months ago

@jiefenghuang I just managed to reproduce the other behavior you mentioned. The setup was the same (one writing and another reading host), but I wrote a small script for the reading side to detect zero regions:

import time
import sys

pat = b"\0"

f = open("/mnt/jfs-partition/test-file", "rb")
pos = 0
while True:
    chunk = f.read(2048)
    if len(chunk) == 0:
        time.sleep(1)
    elif chunk.count(pat) == len(chunk):
        print(f"Pos {pos}: Zeroes chunk detected! Size: {len(chunk)}", flush=True)
    pos += len(chunk)

On the writing side, I ran pv -L 3k /dev/urandom > /mnt/jfs-partition/test-file again. I also ran juicefs info /mnt/jfs-partition/test-file in a loop at the reading side. Eventually I got:

Pos 38912: Zeroes chunk detected! Size: 2048
Pos 40960: Zeroes chunk detected! Size: 2048
Pos 43008: Zeroes chunk detected! Size: 2048
Pos 45056: Zeroes chunk detected! Size: 2048
Pos 47104: Zeroes chunk detected! Size: 2048                                                                                                          
Pos 49152: Zeroes chunk detected! Size: 2048
Pos 51200: Zeroes chunk detected! Size: 1331

And:

/mnt/jfs-partition/test-file :
  inode: 2439006
  files: 1
   dirs: 0
 length: 51.30 KiB (52531 Bytes)
   size: 52.00 KiB (53248 Bytes)
   path: /test-file
 objects:
+------------+--------------------------------------------+-------+--------+--------+
| chunkIndex |                 objectName                 |  size | offset | length |
+------------+--------------------------------------------+-------+--------+--------+
|          0 | jfs-partition/chunks/20/20038/20038936_0_21196 | 21196 |      0 |  21196 |
|          0 | jfs-partition/chunks/20/20834/20834060_0_15668 | 15668 |      0 |  15668 |
|          0 | jfs-partition/chunks/20/20834/20834062_0_15667 | 15667 |      0 |  15667 |
+------------+--------------------------------------------+-------+--------+--------+

Notice that zeroes begin at the chunk border (21196 + 15668 == 38912), and indeed, stretch the entire chunk.

But when I reopened the file later, no zero regions were there!

Can you point me to the known issue?

jiefenghuang commented 3 months ago

@jiefenghuang I just managed to reproduce the other behavior you mentioned. The setup was the same (one writing and another reading host), but I wrote a small script for the reading side to detect zero regions:

import time
import sys

pat = b"\0"

f = open("/mnt/jfs-partition/test-file", "rb")
pos = 0
while True:
    chunk = f.read(2048)
    if len(chunk) == 0:
        time.sleep(1)
    elif chunk.count(pat) == len(chunk):
        print(f"Pos {pos}: Zeroes chunk detected! Size: {len(chunk)}", flush=True)
    pos += len(chunk)

On the writing side, I ran pv -L 3k /dev/urandom > /mnt/jfs-partition/test-file again. I also ran juicefs info /mnt/jfs-partition/test-file in a loop at the reading side. Eventually I got:

Pos 38912: Zeroes chunk detected! Size: 2048
Pos 40960: Zeroes chunk detected! Size: 2048
Pos 43008: Zeroes chunk detected! Size: 2048
Pos 45056: Zeroes chunk detected! Size: 2048
Pos 47104: Zeroes chunk detected! Size: 2048                                                                                                          
Pos 49152: Zeroes chunk detected! Size: 2048
Pos 51200: Zeroes chunk detected! Size: 1331

And:

/mnt/jfs-partition/test-file :
  inode: 2439006
  files: 1
   dirs: 0
 length: 51.30 KiB (52531 Bytes)
   size: 52.00 KiB (53248 Bytes)
   path: /test-file
 objects:
+------------+--------------------------------------------+-------+--------+--------+
| chunkIndex |                 objectName                 |  size | offset | length |
+------------+--------------------------------------------+-------+--------+--------+
|          0 | jfs-partition/chunks/20/20038/20038936_0_21196 | 21196 |      0 |  21196 |
|          0 | jfs-partition/chunks/20/20834/20834060_0_15668 | 15668 |      0 |  15668 |
|          0 | jfs-partition/chunks/20/20834/20834062_0_15667 | 15667 |      0 |  15667 |
+------------+--------------------------------------------+-------+--------+--------+

Notice that zeroes begin at the chunk border (21196 + 15668 == 38912), and indeed, stretch the entire chunk.

But when I reopened the file later, no zero regions were there!

Can you point me to the known issue?

It is a question asked by a user in a WeCom user group, and has not been posted on GitHub Issue.

@abbradar This is not a bug. JuiceFS provides close-to-open consistency, once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client.

  1. If one client is writing while another client is reading simultaneously, inconsistencies may occur. You should expect to obtain valid data through "read-after-flush", not "read-after-write".
  2. If the client write failed due to object storage (or other) issues, this part of file is meaningless. You should not expect to read valid data from it.

I've made some modifications, it may have resolved your question.

abbradar commented 3 months ago

@jiefenghuang Thanks; I have several questions about these.

For (1), with no write failures, does this property mean:

We use JuiceFS for writing append-only binary logs, and read them sequentially on another host in real time. We are not concerned with not having a time guarantee, but the possibility of reading invalid data breaks this pattern.

(2) is a problem for us. I would have expected JuiceFS to behave like a checksummed+journaled file system here—i.e., exclude the possibility of corrupted data, so the worst case becomes either "reading gives you EIO" or "write happened partially/did not happen". Can you clarify a bit more on how this works internally — what sequence of metadata+object operations can result in reading zero chunks, and what happens later when they get magically fixed?

davies commented 2 months ago

We could improve this case in next 1.3 release

abbradar commented 2 months ago

Thanks! We have lots of use cases for "one host appends, another reads", so we are motivated to help with this.