Tar stream closes early on large stream (44+Gb)

MorningLightMountain713 commented 1 month ago

Hi there,

I'm having an issue with tar-fs and I don't even know where to start debugging it. Any pointer on how I can debug this would be great.

It could even have something to do with the underlying system running out of memory and swapping. (Plenty of swap space though) It quite often ends up swapping, but this shouldn't cause the stream to end.

It's pretty consistent, I get around 30-40Gb transferred, then the stream just ends - with no errors raised.

Here is how I am using it:

const workflow = [];

workflow.push(tar.pack(base, {
  entries: folders,
}));

workflow.push(res); # res is an express writeableStream

const pipeline = promisify(stream.pipeline);
const error = await pipeline(...workflow).catch((err) => err);

I addd a stream.PassThrough() in there to see how many bytes are being written - and when the stream ends, it hasn't written all the bytes.

mafintosh commented 1 month ago

does the pipeline end with no error?

MorningLightMountain713 commented 1 month ago

does the pipeline end with no error?

Yes - no error when the pipeline ends. Also the tar is not corrupt - It's just missing files. Is there a way to get errors out per file? If there is an issue does the file get skipped?

Edit - also worth noting, quite often it seems like it is a subdirectory that gets skipped, which in this case - is 11Gb.

Thanks

mafintosh commented 1 month ago

If you only tar that one folder does it work? ie anything you can do to reduce the problem down helps us fix it. try not using any http stream also, just tar-fs on that folder and see if thats broken also

MorningLightMountain713 commented 1 month ago

If you only tar that one folder does it work? ie anything you can do to reduce the problem down helps us fix it. try not using any http stream also, just tar-fs on that folder and see if thats broken also

Yes - that folder does tar successfully.

Okay so I've narrowed it down somewhat.

I have a feeling it is some sort of timing issue and something to do with the subdirectory getting walked and stat ran for each file, which I'm sure takes time - being that it is 5.5k files and 11Gb.

If I add the subdirectory as an entry before the main directory, the entire tar get sent successfully. (I've only tested it once so far, but will continue testing)

Some examples:

Here is how I'm adding the directories:

    const folders = [
      'blocks',
      'chainstate',
      'determ_zelnodes',
    ];

    workflow.push(tar.pack(base, {
      entries: folders,
    }));

If the entries are added in this order (they are all directories with files) then quite often, but not always, the index dir is empty - note there is an index directory inside the blocks directory:

(.venv) davew@beetroot:~/zelflux$ tar tf flux_out.tar
blocks
chainstate
determ_zelnodes

If the entries are added in this order, it works fine and files / dirs are only added once:

(.venv) davew@beetroot:~/zelflux$ tar tf flux_out.tar
blocks/index
blocks
chainstate
determ_zelnodes

if the entries are added in this order, the files can sometimes be doubled up, depending on if the blocks entry loads the files from the index subdir:

(.venv) davew@beetroot:~/zelflux$ tar tf flux_out.tar
blocks
blocks/index
chainstate
determ_zelnodes

I'll try eliminate the http stream and see what happens

MorningLightMountain713 commented 1 month ago

Some info on what is being tarred

root@banana:/home/davew/.flux# du -h blocks chainstate determ_zelnodes
11G blocks/index
34G blocks
415M    chainstate
7.6G    determ_zelnodes
root@banana:/home/davew/.flux# find . -type f | wc -l
10191
root@banana:/home/davew/.flux#

mafintosh commented 1 month ago

nice, thanks, so you are saying its when the folder with tons of small files are at the end it becomes an issue?

MorningLightMountain713 commented 1 month ago

Yes, it seems the ordering is important, I get a good result if I put folders with lots of small files, before folders with big files.

As well as explictly including child folders as entries with lots of files before the parent folder.

Tested this several times now - getting good results by changing the order.

Edit: This is the ordering I'm using now:

    const folders = [
      'determ_zelnodes',
      'blocks/index',
      'chainstate',
      'blocks',
    ];

MorningLightMountain713 commented 1 month ago

I was wrong with the above statement. If I add both the blocks/index and blocks folders - it sometimes adds blocks/index twice.

So as far as I can tell I'm back to the initial timing issue... sometimes it adds the index folder, sometimes it doesn't.

Here is more detail about the blocks folder.

175 dat files, each 128Mb index folder ~5500 2.1Mb files 175 dat files, ranging from 4Mb -12Mb

<snip>
-rw------- 1 davew davew 128M Jul 19 02:56 blk00167.dat
-rw------- 1 davew davew 128M Jul 27 01:53 blk00168.dat
-rw------- 1 davew davew 128M Aug  4 06:36 blk00169.dat
-rw------- 1 davew davew 128M Aug 12 13:01 blk00170.dat
-rw------- 1 davew davew 128M Aug 20 16:40 blk00171.dat
-rw------- 1 davew davew 128M Aug 28 20:28 blk00172.dat
-rw------- 1 davew davew 128M Sep  5 21:38 blk00173.dat
-rw------- 1 davew davew 128M Sep 13 23:51 blk00174.dat
-rw------- 1 davew davew  48M Sep 16 21:46 blk00175.dat
drwx------ 2 davew davew 188K Sep 16 21:51 index
-rw------- 1 davew davew 6.3M May 24  2022 rev00000.dat
-rw------- 1 davew davew 5.6M May 24  2022 rev00001.dat
-rw------- 1 davew davew 7.2M May 24  2022 rev00002.dat
-rw------- 1 davew davew  11M May 24  2022 rev00003.dat
-rw------- 1 davew davew 7.5M May 24  2022 rev00004.dat
-rw------- 1 davew davew 7.9M May 24  2022 rev00005.dat
-rw------- 1 davew davew 9.0M May 24  2022 rev00006.dat
-rw------- 1 davew davew 8.8M May 24  2022 rev00007.dat
<snip>

mafintosh commented 1 month ago

If you run this

const stream = tar.pack(your-options)

let n = 
stream.on('data', function (data) {
  n += data.byteLength
})
stream.on('end', function () {
  console.log(n)
})

Does it give a different result everytime?

MorningLightMountain713 commented 1 month ago

I get the same result each time.

davew@banana:~/zelflux$ node test_stream.js
44020073472
davew@banana:~/zelflux$ node test_stream.js
44020073472
davew@banana:~/zelflux$ node test_stream.js
44020073472
davew@banana:~/zelflux$ node test_stream.js
44020073472
davew@banana:~/zelflux$ node test_stream.js
44020073472

However, something that is important that I forgot to mention, is that there is a service running while the tar is in progress, which can change the size of a few of the files (they have the latest timestamps in the leveldb)

This is what it looks like while the process is running, it's still picking up the index folder.

davew@banana:~/zelflux$ node test_stream.js
44018606592
davew@banana:~/zelflux$ node test_stream.js
44018614784
davew@banana:~/zelflux$ node test_stream.js
44027934720
davew@banana:~/zelflux$ node test_stream.js
44027935744

mafintosh commented 1 month ago

ohhhhhhh thats prop the issue, you are prop corrupting the tar as you are producing it then cause the header is written first in tar so if the file shrinks during the stat that creates issues

mafintosh / tar-fs

Tar stream closes early on large stream (44+Gb) #111