Data corruption: zeroes are sometimes written instead of the actual data

abbradar commented 3 months ago

What happened: When writing a file on a host, sometimes a zero-filled region may be written instead of the actual data.

What you expected to happen: No data corruption.

How to reproduce it (as minimally and precisely as possible):

On one host: pv -L 3k /dev/urandom > /mnt/jfs-partition/test-file On another host: while ! hexdump -C /mnt/jfs-partition/test-file | grep '00 00 00 00'; do sleep 1; done

After some time, the command on the second host will find a large cluster of zeroes (>1k) and stop. In the file, you see, for instance:

00033130  26 32 91 00 00 00 00 00  00 00 00 00 00 00 00 00  |&2..............|
00033140  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00036000  88 ff a0 0b c6 17 26 95  78 b7 e3 28 f5 35 8b 98  |......&.x..(.5..|

Anything else we need to know?

More details:

The network on the writing host is quite overloaded/unstable; I believe this may be related because the issue only occurs when the network is overloaded. No other (hardware, disk) failures are observed on the host;
The zero byte clusters are often, but not always, have 1k-divisible size;
EIO errors sometimes happen on the writing side (because of the bad networking), but they don't seem to correlate to the issues directly;
Zero clusters appear in the middle of the file, not at the beginning or end;
Compression and encryption are enabled.

Unfortunately, we have many irrelevant logs on the writing server because this is a production host. These ones may be relevant (the file inode is 2439005):

2024/07/30 07:53:45.891705 juicefs[4019926] <ERROR>: write inode:2439005 indx:0  input/output error [writer.go:211]
2024/07/30 07:53:52.457389 juicefs[4019926] <WARNING>: slow request: PUT chunks/20/20750/20750481_3_4194304 (req_id: "", err: Put "https://somehost/somedb/%2Fchunks%2F20%2F20750%2F20750481_3_4194304": write tcp 10.42.43.15:60050->100.118.102.36:443: use of closed network connection, cost: 59.994437087s) [cached_store.go:667]
[mysql] 2024/07/30 07:54:35 packets.go:37: read tcp 10.42.43.15:47762->10.0.8.236:3306: read: connection reset by peer
2024/07/30 07:54:35.379260 juicefs[4019926] <WARNING>: Upload chunks/20/20750/20750498_5_4194304: timeout after 1m0s: function timeout (try 1) [cached_store.go:407]
2024/07/30 07:54:39.800111 juicefs[4019926] <INFO>: slow operation: flush (2439005,17488,6D9AF143D1C50E5B) - input/output error <53.869292> [accesslog.go:83]

What else do we plan to try:

Writing from different hosts with similar networking to completely figure out hardware issues;
Checking the metadata to see if the zero clusters correspond to single chunks.

Environment:

JuiceFS version (use juicefs --version) or Hadoop Java SDK version: juicefs version 1.2.0+2024-06-18.873c47b922ba (both hosts)
Cloud provider or hardware configuration running JuiceFS: Bare-metal host on the writing side, Aliyun ECS on the reading side
OS (e.g cat /etc/os-release): NixOS 24.11 (Vicuna) (both hosts)
Kernel (e.g. uname -a): 6.1.90 (both hosts)
Object storage (cloud provider and region, or self maintained): Aliyun OSS
Metadata engine info (version, cloud provider managed or self maintained): Aliyun RDS, MySQL
Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage): Private Aliyun networking on the reading side, public Internet on the writing side
Others:

jiefenghuang commented 3 months ago

It seems similar to one previous issue, it could be caused by concurrent chunk commit that lead to file length changes. Would it be normal reading again after a while?

abbradar commented 3 months ago

No, the file that was corrupted in this way stays the same. I just checked the one that I tested this on. The flow has been:

Run the commands above;
First, the reading one (hexdump) detects a zero region and stops;
In a minute, the writing one (pv) crashes with EIO (but I think I saw instances of this bug happening without the writing side crashing);
The written file is not touched for ~3 hours, and then (right now) I check if it still has the zero regions — it does.

jiefenghuang commented 3 months ago

@abbradar This is not a bug. JuiceFS provides close-to-open consistency, once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client.

If one client is writing while another client is reading simultaneously, inconsistencies may occur. You should expect to obtain valid data through "read-after-flush", not "read-after-write".
If the client write failed due to object storage (or other) issues, this part of file is meaningless. You should not expect to read valid data from it.

abbradar commented 3 months ago

Additionally, if the client write failed due to object storage issues, the behavior at the point of write failure is undefined, the file may contain zero region.

Do you mean that if the object storage itself exhibits this behavior and may leave corrupted zero-filled objects, then this propagates to the FS level? I would have expected the region to be random garbage or for a read error to happen then because we use JuiceFS encryption on top; please correct me if I'm wrong.

BTW, does JuiceFS use the Content-MD5 header during uploads for S3-compatible storage?

EDIT:

JuiceFS provides close-to-open consistency, once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client.

Sorry I didn't ask this at first, but just to be clear — do these inconsistencies only include synchronization (say, if we update different parts of the file, the order may not be preserved, or the file might not be updated at all until closed), or may lead to the client reading completely invalid data (zeroes or other garbage)?

abbradar commented 3 months ago

@jiefenghuang I just managed to reproduce the other behavior you mentioned. The setup was the same (one writing and another reading host), but I wrote a small script for the reading side to detect zero regions:

import time
import sys

pat = b"\0"

f = open("/mnt/jfs-partition/test-file", "rb")
pos = 0
while True:
    chunk = f.read(2048)
    if len(chunk) == 0:
        time.sleep(1)
    elif chunk.count(pat) == len(chunk):
        print(f"Pos {pos}: Zeroes chunk detected! Size: {len(chunk)}", flush=True)
    pos += len(chunk)

On the writing side, I ran pv -L 3k /dev/urandom > /mnt/jfs-partition/test-file again. I also ran juicefs info /mnt/jfs-partition/test-file in a loop at the reading side. Eventually I got:

Pos 38912: Zeroes chunk detected! Size: 2048
Pos 40960: Zeroes chunk detected! Size: 2048
Pos 43008: Zeroes chunk detected! Size: 2048
Pos 45056: Zeroes chunk detected! Size: 2048
Pos 47104: Zeroes chunk detected! Size: 2048                                                                                                          
Pos 49152: Zeroes chunk detected! Size: 2048
Pos 51200: Zeroes chunk detected! Size: 1331

And:

/mnt/jfs-partition/test-file :
  inode: 2439006
  files: 1
   dirs: 0
 length: 51.30 KiB (52531 Bytes)
   size: 52.00 KiB (53248 Bytes)
   path: /test-file
 objects:
+------------+--------------------------------------------+-------+--------+--------+
| chunkIndex |                 objectName                 |  size | offset | length |
+------------+--------------------------------------------+-------+--------+--------+
|          0 | jfs-partition/chunks/20/20038/20038936_0_21196 | 21196 |      0 |  21196 |
|          0 | jfs-partition/chunks/20/20834/20834060_0_15668 | 15668 |      0 |  15668 |
|          0 | jfs-partition/chunks/20/20834/20834062_0_15667 | 15667 |      0 |  15667 |
+------------+--------------------------------------------+-------+--------+--------+

Notice that zeroes begin at the chunk border (21196 + 15668 == 38912), and indeed, stretch the entire chunk.

But when I reopened the file later, no zero regions were there!

Can you point me to the known issue?

jiefenghuang commented 3 months ago

@jiefenghuang I just managed to reproduce the other behavior you mentioned. The setup was the same (one writing and another reading host), but I wrote a small script for the reading side to detect zero regions:

import time
import sys

pat = b"\0"

f = open("/mnt/jfs-partition/test-file", "rb")
pos = 0
while True:
    chunk = f.read(2048)
    if len(chunk) == 0:
        time.sleep(1)
    elif chunk.count(pat) == len(chunk):
        print(f"Pos {pos}: Zeroes chunk detected! Size: {len(chunk)}", flush=True)
    pos += len(chunk)

On the writing side, I ran pv -L 3k /dev/urandom > /mnt/jfs-partition/test-file again. I also ran juicefs info /mnt/jfs-partition/test-file in a loop at the reading side. Eventually I got:

Pos 38912: Zeroes chunk detected! Size: 2048
Pos 40960: Zeroes chunk detected! Size: 2048
Pos 43008: Zeroes chunk detected! Size: 2048
Pos 45056: Zeroes chunk detected! Size: 2048
Pos 47104: Zeroes chunk detected! Size: 2048                                                                                                          
Pos 49152: Zeroes chunk detected! Size: 2048
Pos 51200: Zeroes chunk detected! Size: 1331

And:

/mnt/jfs-partition/test-file :
  inode: 2439006
  files: 1
   dirs: 0
 length: 51.30 KiB (52531 Bytes)
   size: 52.00 KiB (53248 Bytes)
   path: /test-file
 objects:
+------------+--------------------------------------------+-------+--------+--------+
| chunkIndex |                 objectName                 |  size | offset | length |
+------------+--------------------------------------------+-------+--------+--------+
|          0 | jfs-partition/chunks/20/20038/20038936_0_21196 | 21196 |      0 |  21196 |
|          0 | jfs-partition/chunks/20/20834/20834060_0_15668 | 15668 |      0 |  15668 |
|          0 | jfs-partition/chunks/20/20834/20834062_0_15667 | 15667 |      0 |  15667 |
+------------+--------------------------------------------+-------+--------+--------+

Notice that zeroes begin at the chunk border (21196 + 15668 == 38912), and indeed, stretch the entire chunk.

But when I reopened the file later, no zero regions were there!

Can you point me to the known issue?

It is a question asked by a user in a WeCom user group, and has not been posted on GitHub Issue.

@abbradar This is not a bug. JuiceFS provides close-to-open consistency, once a file is written and closed, it is guaranteed to view the written data in the following opens and reads from any client.

If one client is writing while another client is reading simultaneously, inconsistencies may occur. You should expect to obtain valid data through "read-after-flush", not "read-after-write".

If the client write failed due to object storage (or other) issues, this part of file is meaningless. You should not expect to read valid data from it.

I've made some modifications, it may have resolved your question.

abbradar commented 3 months ago

@jiefenghuang Thanks; I have several questions about these.

For (1), with no write failures, does this property mean:

Parts of the file are not synchronized (e.g., the order of random writes is not preserved, and writes are not guaranteed to be visible) before close();
Or the reader might read completely invalid data?

We use JuiceFS for writing append-only binary logs, and read them sequentially on another host in real time. We are not concerned with not having a time guarantee, but the possibility of reading invalid data breaks this pattern.

(2) is a problem for us. I would have expected JuiceFS to behave like a checksummed+journaled file system here—i.e., exclude the possibility of corrupted data, so the worst case becomes either "reading gives you EIO" or "write happened partially/did not happen". Can you clarify a bit more on how this works internally — what sequence of metadata+object operations can result in reading zero chunks, and what happens later when they get magically fixed?

davies commented 2 months ago

We could improve this case in next 1.3 release

abbradar commented 2 months ago

Thanks! We have lots of use cases for "one host appends, another reads", so we are motivated to help with this.

juicedata / juicefs

Data corruption: zeroes are sometimes written instead of the actual data #5038