Crash on packing - Githubissues

jonyg80 commented 3 years ago

Crash when trying to pack 19GB folder

<--- Last few GCs --->

[10721:0x4721100]   579166 ms: Scavenge (reduce) 2045.5 (2050.4) -> 2044.9 (2051.9) MB, 9.2 / 0.0 ms  (average mu = 0.293, current mu = 0.262) allocation failure 
[10721:0x4721100]   584654 ms: Mark-sweep (reduce) 2045.8 (2053.9) -> 2038.9 (2053.9) MB, 5467.1 / 0.2 ms  (+ 1.7 ms in 4060 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 6128 ms) (average mu = 0.238, current mu = 0.18

<--- JS stacktrace --->

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
 1: 0xa04200 node::Abort() [🚘]
 2: 0x94e4e9 node::FatalError(char const*, char const*) [🚘]
 3: 0xb7978e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [🚘]
 4: 0xb79b07 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [🚘]
 5: 0xd34395  [🚘]
 6: 0xd34f1f  [🚘]
 7: 0xd42fab v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [🚘]
 8: 0xd46b6c v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [🚘]
 9: 0xd1524b v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [🚘]
10: 0x105b23f v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [🚘]
11: 0x1401219  [🚘]

Received "aborted" signal

vasco-santos commented 3 years ago

Hey @jonyg80

Thanks for reporting this. Can you give me more information?

What pack function you are using, it would be great to have a small snippet of how you are using this module to pack, as well as your environment details.

jonyg80 commented 3 years ago

@vasco-santos I am using it to pack a git repo.

git clone --bare https://gitlab.com/gitlab-org/gitlab.git gitlab
cd gitlab
git update-server-info
mv objects/pack/*.pack .
git unpack-objects < *.pack
rm -f *.pack objects/pack/*
cd ..
ipfs-car --pack ./gitlab

Environment details

vCPUs | RAM -- | -- 2 | 7.5GB

vasco-santos commented 3 years ago

@jonyg80 sorry for taking long to answer, but I was away. This might have fixed the issue, but I am trying your script at the moment

vasco-santos commented 3 years ago

@jonyg80 I could replicate your issue, thanks for reporting. I need to see where the problem might be

vasco-santos commented 3 years ago

Giving an update on this. I have been trying to narrow the problem down to what is the problematic component.

It seems that the problem is in our FsBlockstore. Replaced it for testing with https://github.com/ipfs/js-datastore-fs and it looks good so far. Extremely slow though, I have been running it for ~6H and it is still packing the 19GB Gitlab repo

The good news is that memory usage seems stable after these 6H, while with our own FS Blockstore it went to the roof after ~1H work
Looking into js-datastore-fs and they use fast-write-atomic for writing (we do fs.writeFile) and fs.readFile for reading (like we do)
- We don’t do really nothing except the fs.* operations

I am waiting on the entire packing to end gracefully. If this is the case:

I am planning on just extend IPFS core implementation and map the query functionality to the blocks async generator we have in our interface.
We previously went on this direction as we did not have an alternative, but with the multiformats update in IPFS core, we can simply use blockstore-datastore-adapter to get a blockstore from the FsDatastore to give unixfs-importer

vasco-santos commented 3 years ago

This is not a problem with FsBlockstore ater all. I kept the pack running for ~6hours, but it got to the memory crash too.

So, my current inclination is that this is related to unixfs-importer, but I need to do more tests. Unixfs-importer seems to have problems with the 3223073 being added at the same time, needing to consume a lot of memory to build the dag

atopal commented 3 years ago

vasco-santos commented 3 years ago

After long tests taking days to run on my machine, I got to a minimal reproducible case of this, a subset of how ipfs-car packs given files:

import os from 'os'
import process from 'process'

import DatastoreFS from 'datastore-fs'
import BlockstoreDatastoreAdapter from 'blockstore-datastore-adapter'

import { importer } from 'ipfs-unixfs-importer'
import { normaliseInput } from 'ipfs-core-utils/src/files/normalise-input/index.js'
import globSource from 'ipfs-utils/src/files/glob-source.js'
import last from 'it-last'
import pipe from 'it-pipe'

async function main () {
  const input = process.argv[2]

  if (!input) {
    throw new Error('no input provided')
  }

  const location = `${os.tmpdir()}/${(parseInt(String(Math.random() * 1e9), 10)).toString() + Date.now()}`
  console.log('blockstore path', location)
  const blockstore = new BlockstoreDatastoreAdapter(
    new DatastoreFS(`${location}/blocks`, {
      extension: '.data'
    })
  )

  await last(pipe(
    globSource(input, {
      recursive: true
    }),
    (source) => normaliseInput(source),
    (source) => importer(source, blockstore, {})
  ))
}

main()

This ends up failing for too much memory consumption, like mentioned in the original post. However, if we change the pipe to run importer per normaliseInput it will work as expected without any memory leak:

let res
for await (const source of normaliseInput(globSource(input, { recursive: true }))) {
  res = await last(importer(source, blockstore, {}))
}

It will not get into the memory consumption, nor too many open files error. After some debugging with the original code, I noticed that the importer seemed slower consuming what was yielded from the generators (per logging). This seems to result into readable streams to be opened for more and more files over time, which will need longer and longer wait times until being imported.

I also tried:

await last(pipe(
    globSource(input, {
      recursive: true
    }),
    (source) => normaliseInput(source),
    (source) => importer(source, blockstore, {
      fileImportConcurrency: 1
    })
  ))

but it seemed to not do the difference as the previous codebase where memory was not an issue.

I am not super familiar with the codebases of globSource, normaliseInput and unixfs-importer and I could not get more progress identifying where the problem is to have more files yielded than the consumer can consume.

@achingbrain could I get some help on possible causes for this?

achingbrain commented 3 years ago

I'm lacking a bit of context here but some observations...

Looking into js-datastore-fs and they use fast-write-atomic for writing (we do fs.writeFile) and fs.readFile for reading (like we do) We don’t do really nothing except the fs.* operations

fs.* operations are non-atomic so if the process crashes during writing, you'll end up with a corrupt block store as files will only be half-written. That's why datastore-fs uses fast-write-atomic.

For this you'll invoke the importer repeatedly for every piece of input:

let res
for await (const source of normaliseInput(globSource(input, { recursive: true }))) {
  res = await last(importer(source, blockstore, {}))
}

e.g. the final CID will not be the same as for:

const res = await pipe(
  globSource(input, {
    recursive: true
  }),
  (source) => normaliseInput(source),
  (source) => importer(source, blockstore, {}),
  (source) => last(source)
)

It also takes about 5x longer in my (admittedly unscientific) testing.

Extremely slow though

Off the top of my head there's a bottleneck in the importer whereby it writes directory contents out sequentially, eliminating that will speed it up quite a bit, though if the directory is big it'll overwhelm the runtime.

TBH I've been meaning to rewrite the whole thing so the importer passes a stream to the blockstore's .putMany method which can then pull blocks out of the stream as fast as it can write them. It could even auto-tune the parallelisation of that method based on current throughput, that'd be fun.

Need to profile this though.

I'm running a modified version of your repro case in a directory with a node_modules folder that includes all of the deps necessary to run the repro case:

import process from 'process'
import DatastoreFS from 'datastore-fs'
import BlockstoreDatastoreAdapter from 'blockstore-datastore-adapter'
import { importer } from 'ipfs-unixfs-importer'
import { normaliseInput } from 'ipfs-core-utils/src/files/normalise-input/index.js'
import globSource from 'ipfs-utils/src/files/glob-source.js'
import last from 'it-last'
import pipe from 'it-pipe'
import pretty from 'pretty-bytes'

async function main () {
  const input = process.argv[2]

  if (!input) {
    throw new Error('no input provided')
  }

  const location = `${(parseInt(String(Math.random() * 1e9), 10)).toString() + Date.now()}`
  console.log('blockstore path', location)
  const blockstore = new BlockstoreDatastoreAdapter(
    new DatastoreFS(`${location}/blocks`, {
      extension: '.data'
    })
  )

  let heapUsed = 0

  const interval = setInterval(() => {
    globalThis.gc()

    const stats = process.memoryUsage()

    if (stats.heapUsed > heapUsed) {
      heapUsed = stats.heapUsed
    }
  }, 100)

  const start = Date.now()

  globalThis.gc()
  console.info('Heap before', pretty(process.memoryUsage().heapUsed))

  const res = await pipe(
    globSource(input, {
      recursive: true
    }),
    (source) => normaliseInput(source),
    (source) => importer(source, blockstore, {}),
    (source) => last(source)
  )

  console.info('Took', Date.now() - start, 'ms')

  clearInterval(interval)
  globalThis.gc()
  console.info('Heap after', pretty(process.memoryUsage().heapUsed))
  console.info('Max heap', pretty(heapUsed))
  console.info(res)
}

main()

I see:

$ node --expose-gc  index.js node_modules
blockstore path 1287063661630500222542
Heap before 5.82 MB
Took 44536 ms
Heap after 6.2 MB
Max heap 9.18 MB
{
  cid: CID(QmQcjTmAk1PVATCmzqfMdbYmoFvY1U9jrkzqwY9QSSAa13),
  path: 'node_modules',
  unixfs: UnixFS {
    type: 'directory',
    data: undefined,
    hashType: undefined,
    fanout: undefined,
    blockSizes: [],
    _originalMode: 0,
    _mode: 493
  },
  size: 12390787
}

So I don't think there's a memory leak since most of the heap gets reclaimed but the in-flight memory usage might not be the most efficient.

We should profile this properly but there are a few things that spring to mind:

The importer keeps an in-memory representation of the tree as it's building it. This will cause excessive memory use for very deep trees

We could work around this by flushing the sub-trees whenever a higher path is encountered, eg:

  importer([{
    path: '/foo/bar', content
  }, {
    path: '/foo/baz', content // Path is outside /foo/bar, flush /foo/bar to reclaim the memory
  }])

If a subsequent deep path is encountered we'd need to recreate that section of the tree:

  importer([{
    path: '/foo/bar', content
  }, {
    path: '/foo/baz', content // Path is outside /foo/bar, flush /foo/bar to reclaim the memory
  }, {
    path: '/foo/bar/qux', content // Oh no, path is under /foo/bar, re-create representation to calculate new CID of sub-tree
  }])

it-glob (used by globSource) uses fs.readdir internally which only returns arrays of files and has no way of doing pagination. This will cause excessive memory use for very wide trees.
- If this is the problem it should be refactored to use fs.opendir and dir.read or dir[Symbol.asyncIterator]

vasco-santos commented 2 years ago

it-glob was updated and not use fs.readdir anymore: https://github.com/achingbrain/it/pull/16 thanks @achingbrain ❤️ and shipped on ipfs-car 0.5.9. It would be great to have this tested with either 0.5.9 or newer versions to confirm if this helped with this problem.

vasco-santos commented 2 years ago

@achingbrain thanks for all your thoughts. We need to explore your idea to enhance deep trees

storacha / ipfs-car

Crash on packing #69