Closed bimcas closed 3 years ago
Hi!
unexpected EOF
is broken file download. WAL-G retries files few times with lower parallelism. Butif the same file failes again and again - WAL-G will give up.
Hi
Hmm ... It's strange. I tried manually upload/download archives (like part_001.tar.lz4 or similar) to/from S3 storage. Everything works perfectly. The connection is stable and the speed is good enough, but the restoration from backup is failing with error messages:
DecryptAndDecompressTar: lz4 decompress failed. Is archive encrypted ?: DecompressLz4: lz4 write failed: unexpected EOF
extractOne: Interpret failed: Interpret: copy failed: unexpected EOF
Failed to fetch backup: failed to extract files: part_001.tar.lz4
Also, I tried to use different S3 storages (different servers), but the restoration from backup is almost failing. No error messages during the creation of backups and uploads them to S3 storage. Problems occur only with the restoration from backup . I'm rejecting the idea that something is wrong with S3 storage connectivity, stability, or file downloading issue. Manual downloads are fast, always successful and downloaded files from S3 aren't corrupted.
Maybe something wrong could be:
envdir "${ENV_DIR}" wal-g backup-push "${PGDATA}"
envdir "${ENV_DIR}" backup-fetch "${PGDATA}" LATEST
Now I'm using wal-e. I think to start use wal-g, but those issues stop my idea to do it. How to solve this problem? What could you recommend?
Decompression occurs on the fly right after receiving bytes from the network. No extra libs are necessary. It's strange that you encounter an error with exactly first file of a backup. Maybe there some kind of privileges issues? You can enable debug log by setting WALG_LOG_LEVEL=DEBUG.
I get the same error. I'm using Backblaze's S3 API and the first two parts are always failing to decompress.
Might be a duplicate of https://github.com/wal-g/wal-g/issues/449 that contains an interesting finding https://github.com/wal-g/wal-g/issues/449#issuecomment-607828734
I also have the same problem running v0.2.19
, except I use brotli
INFO: 2021/05/06 21:53:42.195928 Finished decompression of part_001.tar.br
ERROR: 2021/05/06 21:53:42.196046 part_001.tar.br DecryptAndDecompressTar: br decompress failed. Is archive encrypted?: unexpected EOF
INFO: 2021/05/06 21:53:42.197454 Finished extraction of part_001.tar.br
ERROR: 2021/05/06 21:53:42.197563 Extraction error in part_001.tar.br: extractOne: Interpret failed: Interpret: copy failed: unexpected EOF
I'm using Digital Ocean droplet with attached volume (https://docs.digitalocean.com/products/volumes/)
Interesting thing here is that it fails when I try to restore to an attached volume, but it goes smoothly if I restore to /tmp
On the Volumes product page they mention that they are using Ceph. Could it be that Ceph is a source of this problem?
Disk write speed is different, almost x2 times difference for me:
> dd if=/dev/zero of=/mnt/data_db_01_master/tmp/testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 3.82629 s, 281 MB/s
> dd if=/dev/zero of=/root/testfile bs=1G count=1 oflag=direct
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.14976 s, 499 MB/s
I tried WALG_DOWNLOAD_CONCURRENCY=1
and results are the same.
Maybe tcpdump will be helpful:
when it fails to restore (RST
is sent by Object Storage)
and when it succeeds
Looks like connection is closed by Object Storage?
Also interesting finding:
TCP Zero Window
is sent by the client, and maybe that's why there is an RST
from DO S3 compatible storage coming after
Normal disk (I have to mention, that part_001.tar.br is only 9.9 Mb):
INFO: 2021/05/07 21:28:55.432923 LATEST backup is: 'base_00000001000001C700000011_D_00000001000001C200000059'
INFO: 2021/05/07 21:28:55.596283 Delta from base_00000001000001C200000059_D_00000001000001BD00000092 at LSN 1c259000028
INFO: 2021/05/07 21:28:55.660371 Delta from base_00000001000001BD00000092_D_00000001000001B8000000B6 at LSN 1bd92000028
INFO: 2021/05/07 21:28:55.743961 Delta from base_00000001000001B8000000B6_D_00000001000001B3000000E9 at LSN 1b8b6000028
INFO: 2021/05/07 21:28:55.805089 Delta from base_00000001000001B3000000E9 at LSN 1b3e9000028
INFO: 2021/05/07 21:29:08.465294 Finished extraction of part_001.tar.br
INFO: 2021/05/07 21:29:08.465639 Finished decompression of part_001.tar.br
INFO: 2021/05/07 21:29:12.200141 Finished decompression of part_002.tar.br
INFO: 2021/05/07 21:29:12.200174 Finished extraction of part_002.tar.br
INFO: 2021/05/07 21:29:12.218050 Finished extraction of part_004.tar.br
INFO: 2021/05/07 21:29:12.218177 Finished decompression of part_004.tar.br
INFO: 2021/05/07 21:29:12.238381 Finished extraction of pg_control.tar.br
INFO: 2021/05/07 21:29:12.238399
Ceph disk
INFO: 2021/05/07 21:29:30.196952 LATEST backup is: 'base_00000001000001C700000011_D_00000001000001C200000059'
INFO: 2021/05/07 21:29:30.359867 Delta from base_00000001000001C200000059_D_00000001000001BD00000092 at LSN 1c259000028
INFO: 2021/05/07 21:29:30.447245 Delta from base_00000001000001BD00000092_D_00000001000001B8000000B6 at LSN 1bd92000028
INFO: 2021/05/07 21:29:30.534594 Delta from base_00000001000001B8000000B6_D_00000001000001B3000000E9 at LSN 1b8b6000028
INFO: 2021/05/07 21:29:30.613034 Delta from base_00000001000001B3000000E9 at LSN 1b3e9000028
INFO: 2021/05/07 21:30:12.745096 Finished decompression of part_001.tar.br
ERROR: 2021/05/07 21:30:12.745133 part_001.tar.br DecryptAndDecompressTar: br decompress failed. Is archive encrypted?: unexpected EOF
INFO: 2021/05/07 21:30:12.746438 Finished extraction of part_001.tar.br
ERROR: 2021/05/07 21:30:12.746463 Extraction error in part_001.tar.br: extractOne: Interpret failed: Interpret: copy failed: unexpected EOF
INFO: 2021/05/07 21:30:28.290893 Finished decompression of part_002.tar.br
INFO: 2021/05/07 21:30:28.290924 Finished extraction of part_002.tar.br
INFO: 2021/05/07 21:30:28.341604 Finished extraction of part_004.tar.br
INFO: 2021/05/07 21:30:28.341634 Finished decompression of part_004.tar.br
INFO: 2021/05/07 21:32:25.764028 Finished decompression of part_001.tar.br
ERROR: 2021/05/07 21:32:25.764073 part_001.tar.br DecryptAndDecompressTar: br decompress failed. Is archive encrypted?: unexpected EOF
INFO: 2021/05/07 21:32:25.766421 Finished extraction of part_001.tar.br
ERROR: 2021/05/07 21:32:25.766474 Extraction error in part_001.tar.br: extractOne: Interpret failed: Interpret: copy failed: unexpected EOF
ERROR: 2021/05/07 21:32:25.766517 Failed to fetch backup: failed to extract files:
part_001.tar.br
I have a feeling that it might be something about Stream which is taken all the way from s3 client to decompressor, and how buffers work in Ceph and how decompressed files are written to disk.
Wondering if saving to temporary file instead of stream processing right away will change the situation..?
A bit more information:
If I generate some load (writing to disk) with fio
fio --filename=/mnt/volume_lon1_01/fio.test --direct=1 --rw=randwrite --ioengine=libaio --bs=4K --iodepth=128 --name=rand-w --runtime=120s --size=10000000000
Then backup-fetch will fail with rather long time waiting to get part_001.tar.br
wal-g backup-fetch bak LATEST --config /var/lib/postgresql/.walg.json
INFO: 2021/05/09 08:46:20.822339 LATEST backup is: 'base_00000001000001D00000003E'
INFO: 2021/05/09 08:47:50.093221 Finished decompression of part_001.tar.br
ERROR: 2021/05/09 08:47:50.093262 part_001.tar.br DecryptAndDecompressTar: br decompress failed. Is archive encrypted?: unexpected EOF
INFO: 2021/05/09 08:47:50.093979 Finished extraction of part_001.tar.br
ERROR: 2021/05/09 08:47:50.093998 Extraction error in part_001.tar.br: extractOne: Interpret failed: Interpret: copy failed: unexpected EOF
Without write heavy operation I can still restore to /mnt/volume_lon1_01, but even couple apps writing log file to that disk will affect the process...
But if I do the same on another disk (not an attached block storage device), then everything works:
wal-g backup-fetch /tmp/bak LATEST --config /var/lib/postgresql/.walg.json
INFO: 2021/05/09 08:50:20.756232 LATEST backup is: 'base_00000001000001D00000003E'
INFO: 2021/05/09 08:50:37.397772 Finished extraction of part_001.tar.br
INFO: 2021/05/09 08:50:37.397971 Finished decompression of part_001.tar.br
INFO: 2021/05/09 08:50:43.933277 Finished decompression of part_002.tar.br
INFO: 2021/05/09 08:50:43.933305 Finished extraction of part_002.tar.br
INFO: 2021/05/09 08:50:43.964077 Finished decompression of part_004.tar.br
INFO: 2021/05/09 08:50:43.964106 Finished extraction of part_004.tar.br
INFO: 2021/05/09 08:50:43.997018 Finished decompression of pg_control.tar.br
INFO: 2021/05/09 08:50:43.997035 Finished extraction of pg_control.tar.br
INFO: 2021/05/09 08:50:43.997143
Backup extraction complete.
root@debug-lab:/mnt/volume_lon1_01# df
Filesystem 1K-blocks Used Available Use% Mounted on
udev 4071096 0 4071096 0% /dev
tmpfs 816800 644 816156 1% /run
/dev/vda1 25226960 13098068 12112508 52% /
tmpfs 4083992 0 4083992 0% /dev/shm
tmpfs 5120 0 5120 0% /run/lock
tmpfs 4083992 0 4083992 0% /sys/fs/cgroup
/dev/sda 103900088 29595500 69045324 31% /mnt/volume_lon1_01
/dev/vda15 106858 6170 100688 6% /boot/efi
tmpfs 816796 0 816796 0% /run/user/0
But at the same time I can use https://github.com/s3tools/s3cmd and download the file without any issues.
@x4m I'm wondering if that can make any difference?
Decompression occurs on the fly right after receiving bytes from the network.
Maybe "under" load the buffer size is changed and amount of bytes fetched is not enough to decompress contents? What if we download contents to a tempFile first (https://golang.org/pkg/io/ioutil/#example_TempFile) and then try decompress/extract?
I did a quick test and simply dumped readCloser
stream into file.
https://github.com/wal-g/wal-g/blob/0261341f03bafb48960dbb5835d2b921cbb92b14/internal/extract.go#L102-L103
Files are there and were created without any problems (no EOF in 10 tries).
Hi,
My DB server: CentOS 7.7 PostgreSQL 12.5
I'm using WAL-G v0.2.19 (Latest). Backups and WALs go to S3 storage.
It's interesting that during restoration processes I'm getting some error messages:
Sometimes restore is OK, but sometimes it takes very long and finally crashed.
I create backups in a regular way with backup-push (no encryption or other special elements).
The creation of backups and uploads to S3 storage is always normal and successful. Problems occur only with the restore process.
What does it mean "DecryptAndDecompressTar: lz4 decompress failed. Is archive encrypted?: DecompressLz4: lz4 write failed: unexpected EOF"?
Why does it happen?