Open jonyg80 opened 3 years ago
Hey @jonyg80
Thanks for reporting this. Can you give me more information?
What pack function you are using, it would be great to have a small snippet of how you are using this module to pack, as well as your environment details.
@vasco-santos I am using it to pack a git repo.
git clone --bare https://gitlab.com/gitlab-org/gitlab.git gitlab
cd gitlab
git update-server-info
mv objects/pack/*.pack .
git unpack-objects < *.pack
rm -f *.pack objects/pack/*
cd ..
ipfs-car --pack ./gitlab
Environment details
vCPUs | RAM -- | -- 2 | 7.5GB@jonyg80 sorry for taking long to answer, but I was away. This might have fixed the issue, but I am trying your script at the moment
@jonyg80 I could replicate your issue, thanks for reporting. I need to see where the problem might be
Giving an update on this. I have been trying to narrow the problem down to what is the problematic component.
It seems that the problem is in our FsBlockstore. Replaced it for testing with https://github.com/ipfs/js-datastore-fs and it looks good so far. Extremely slow though, I have been running it for ~6H and it is still packing the 19GB Gitlab repo
I am waiting on the entire packing to end gracefully. If this is the case:
This is not a problem with FsBlockstore ater all. I kept the pack running for ~6hours, but it got to the memory crash too.
So, my current inclination is that this is related to unixfs-importer
, but I need to do more tests. Unixfs-importer seems to have problems with the 3223073
being added at the same time, needing to consume a lot of memory to build the dag
possibly related: https://github.com/web3-storage/web3.storage/issues/318
After long tests taking days to run on my machine, I got to a minimal reproducible case of this, a subset of how ipfs-car
packs given files:
import os from 'os'
import process from 'process'
import DatastoreFS from 'datastore-fs'
import BlockstoreDatastoreAdapter from 'blockstore-datastore-adapter'
import { importer } from 'ipfs-unixfs-importer'
import { normaliseInput } from 'ipfs-core-utils/src/files/normalise-input/index.js'
import globSource from 'ipfs-utils/src/files/glob-source.js'
import last from 'it-last'
import pipe from 'it-pipe'
async function main () {
const input = process.argv[2]
if (!input) {
throw new Error('no input provided')
}
const location = `${os.tmpdir()}/${(parseInt(String(Math.random() * 1e9), 10)).toString() + Date.now()}`
console.log('blockstore path', location)
const blockstore = new BlockstoreDatastoreAdapter(
new DatastoreFS(`${location}/blocks`, {
extension: '.data'
})
)
await last(pipe(
globSource(input, {
recursive: true
}),
(source) => normaliseInput(source),
(source) => importer(source, blockstore, {})
))
}
main()
This ends up failing for too much memory consumption, like mentioned in the original post.
However, if we change the pipe
to run importer per normaliseInput it will work as expected without any memory leak:
let res
for await (const source of normaliseInput(globSource(input, { recursive: true }))) {
res = await last(importer(source, blockstore, {}))
}
It will not get into the memory consumption, nor too many open files error. After some debugging with the original code, I noticed that the importer seemed slower consuming what was yielded from the generators (per logging). This seems to result into readable streams to be opened for more and more files over time, which will need longer and longer wait times until being imported.
I also tried:
await last(pipe(
globSource(input, {
recursive: true
}),
(source) => normaliseInput(source),
(source) => importer(source, blockstore, {
fileImportConcurrency: 1
})
))
but it seemed to not do the difference as the previous codebase where memory was not an issue.
I am not super familiar with the codebases of globSource
, normaliseInput
and unixfs-importer
and I could not get more progress identifying where the problem is to have more files yielded than the consumer can consume.
@achingbrain could I get some help on possible causes for this?
I'm lacking a bit of context here but some observations...
Looking into js-datastore-fs and they use fast-write-atomic for writing (we do fs.writeFile) and fs.readFile for reading (like we do) We donβt do really nothing except the fs.* operations
fs.* operations are non-atomic so if the process crashes during writing, you'll end up with a corrupt block store as files will only be half-written. That's why datastore-fs
uses fast-write-atomic
.
For this you'll invoke the importer repeatedly for every piece of input:
let res
for await (const source of normaliseInput(globSource(input, { recursive: true }))) {
res = await last(importer(source, blockstore, {}))
}
e.g. the final CID will not be the same as for:
const res = await pipe(
globSource(input, {
recursive: true
}),
(source) => normaliseInput(source),
(source) => importer(source, blockstore, {}),
(source) => last(source)
)
It also takes about 5x longer in my (admittedly unscientific) testing.
Extremely slow though
Off the top of my head there's a bottleneck in the importer whereby it writes directory contents out sequentially, eliminating that will speed it up quite a bit, though if the directory is big it'll overwhelm the runtime.
TBH I've been meaning to rewrite the whole thing so the importer passes a stream to the blockstore's .putMany
method which can then pull blocks out of the stream as fast as it can write them. It could even auto-tune the parallelisation of that method based on current throughput, that'd be fun.
Need to profile this though.
I'm running a modified version of your repro case in a directory with a node_modules
folder that includes all of the deps necessary to run the repro case:
import process from 'process'
import DatastoreFS from 'datastore-fs'
import BlockstoreDatastoreAdapter from 'blockstore-datastore-adapter'
import { importer } from 'ipfs-unixfs-importer'
import { normaliseInput } from 'ipfs-core-utils/src/files/normalise-input/index.js'
import globSource from 'ipfs-utils/src/files/glob-source.js'
import last from 'it-last'
import pipe from 'it-pipe'
import pretty from 'pretty-bytes'
async function main () {
const input = process.argv[2]
if (!input) {
throw new Error('no input provided')
}
const location = `${(parseInt(String(Math.random() * 1e9), 10)).toString() + Date.now()}`
console.log('blockstore path', location)
const blockstore = new BlockstoreDatastoreAdapter(
new DatastoreFS(`${location}/blocks`, {
extension: '.data'
})
)
let heapUsed = 0
const interval = setInterval(() => {
globalThis.gc()
const stats = process.memoryUsage()
if (stats.heapUsed > heapUsed) {
heapUsed = stats.heapUsed
}
}, 100)
const start = Date.now()
globalThis.gc()
console.info('Heap before', pretty(process.memoryUsage().heapUsed))
const res = await pipe(
globSource(input, {
recursive: true
}),
(source) => normaliseInput(source),
(source) => importer(source, blockstore, {}),
(source) => last(source)
)
console.info('Took', Date.now() - start, 'ms')
clearInterval(interval)
globalThis.gc()
console.info('Heap after', pretty(process.memoryUsage().heapUsed))
console.info('Max heap', pretty(heapUsed))
console.info(res)
}
main()
I see:
$ node --expose-gc index.js node_modules
blockstore path 1287063661630500222542
Heap before 5.82 MB
Took 44536 ms
Heap after 6.2 MB
Max heap 9.18 MB
{
cid: CID(QmQcjTmAk1PVATCmzqfMdbYmoFvY1U9jrkzqwY9QSSAa13),
path: 'node_modules',
unixfs: UnixFS {
type: 'directory',
data: undefined,
hashType: undefined,
fanout: undefined,
blockSizes: [],
_originalMode: 0,
_mode: 493
},
size: 12390787
}
So I don't think there's a memory leak since most of the heap gets reclaimed but the in-flight memory usage might not be the most efficient.
We should profile this properly but there are a few things that spring to mind:
importer([{
path: '/foo/bar', content
}, {
path: '/foo/baz', content // Path is outside /foo/bar, flush /foo/bar to reclaim the memory
}])
importer([{
path: '/foo/bar', content
}, {
path: '/foo/baz', content // Path is outside /foo/bar, flush /foo/bar to reclaim the memory
}, {
path: '/foo/bar/qux', content // Oh no, path is under /foo/bar, re-create representation to calculate new CID of sub-tree
}])
it-glob
(used by globSource
) uses fs.readdir internally which only returns arrays of files and has no way of doing pagination. This will cause excessive memory use for very wide trees.
it-glob
was updated and not use fs.readdir anymore: https://github.com/achingbrain/it/pull/16 thanks @achingbrain β€οΈ and shipped on ipfs-car
0.5.9. It would be great to have this tested with either 0.5.9
or newer versions to confirm if this helped with this problem.
@achingbrain thanks for all your thoughts. We need to explore your idea to enhance deep trees
Crash when trying to pack 19GB folder