Open jefft0 opened 7 years ago
Hey @jefft0 could you try out #3756 and see if it solved the problem?
I see you merged the pull request. I pulled and re-built from master, but still get the same error.
@jefft0 interesting... can you help debug this for us?
Theres a program i use for this called lsof
(it should be in most linuxes package repos).
Start up your daemon and get its pid. Then do whatever it is that you do to reproduce this problem, and when its about to happen, grab the output of lsof -p $IPFS_PID
.
It might help to grab it a few times, before, during (as close as you can get to the error) and after. With this info we should be able to figure out where those pids are coming from and fix the bug
See the attached output files from lsof
.
lsof-before.txt
lsof-during.txt
lsof-after.txt
LevelDB is using 500 FDs, this shouldn't happen.
what the hell...
Any ideas on why so many file descriptors?
@jefft0 I've tried to reproduce the issue several times now with no luck. What sort of disks do you have? and is there anything weird about your setup? (like, different weird filesystems, mountpoints, OS tweaks that might affect this)
My files are in an 8-drive JBOD housing over USB. I'm using the default mount point for Ubuntu, but through symbolic links in the .ipfs
folder (as I described elsewhere to meet the security requirement).
@jefft0 and your .ipfs directory (the datastore and blocks directories) are stored on a normal disk? (ssd or spinner?)
Yes, they're on my SSD system drive.
(My default home directory)
@jefft0 Okay, i'll run some experiments with the data being on a mounted drive.
In the meantime, if you don't mind, could you grab a stack dump from your daemon while the file descriptor usage is really high?
curl localhost:5001/debug/pprof/goroutine\?debug=2 > ipfs.stacks
I'm curious to see if there are tons of concurrent processes hitting leveldb for some reason
Attached. ipfs.stacks.txt
@whyrusleeping Here's another data point. I tried the same setup on macOS 10.12 . I'm not running the daemon as I add --nocopy
each file. It's easy to get the PID and run lsof
because it's getting slower and slower as I add each file. It now takes over a minute to add one 200MB file.
Anyway, attached is the lsof
output when adding a file. Why is is necessary to open 363 LDB files at the same time? Is each one opened in a goroutine that maybe doesn't close and terminate?
lsof-no-daemon.txt
Is add --nocopy
supported in js-ipfs on Ubuntu? Do you think it's worth trying to add lots of files with js-ipfs to see if its use of LevelDB doesn't run out of file descriptors?
I don't think it is.
@jefft0 i have no idea why leveldb is behaving that way, it doesnt do so on my machines.
Just seeing your lsof output now
@whyrusleeping On the Ubuntu machine you tested, what version of LevelDB? How was it installed?
I am able to reproduce the error in a fresh Ubuntu virtual machine (both VirtualBox and Parallels) with a fresh installation of the latest go-ipfs. I transfer a 700MB gzip file and expand it as the new installation's ~/.ipfs
. Then, using ipfs add --nocopy --raw-leaves
gives the error "too many open files".
So, if someone wants to reproduce and debug this issue, I can get you the 700MB gzip file. (You don't need the original files that were added.) Would that be useful?
... or better yet, make a virtual machine for you to log into?
@jefft0 Yeah, getting that gzip file would be great!
See ipfs-too-many-fds.tar.gz .
@jefft0 the archive you sent was an ipfs repo. Was your issue with adding an ipfs repo into ipfs? Or did you mean to zip up something else?
That's what I meant to send. (Don't worry about the private key. It was generated just for this test.) Steps to reproduce
ipfs init
.~/.ipfs
ipfs add --nocopy --raw-leaves <file>
For me, it gives the error "too many open files".
Ah, thanks! Will try that soon
On Mon, Apr 10, 2017, 11:15 Jeff Thompson notifications@github.com wrote:
That's what I meant to send. (Don't worry about the private key. It was generated just for this test.) Steps to reproduce
- An installation of Ubuntu 16.04. (Haven't tested with other Ubuntu versions.)
- Install the latest IPFS, but don't do ipfs init.
- Extract the gzip file as your ~/.ipfs
- Put a big file in ~ (for example, the ipfs executable).
- Do ipfs add --nocopy --raw-leaves
For me, it gives the error "too many open files".
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ipfs/go-ipfs/issues/3763#issuecomment-292980903, or mute the thread https://github.com/notifications/unsubscribe-auth/ABL4HOkmQMXDKz66xQ7ItRYmw5UveLJIks5rukd2gaJpZM4MV8et .
BTW, I tried ipfs-too-many-fds.tar.gz with the latest Ubuntu 17.04 (and the latest go-ipfs). Same result, "too many open files".
@jefft0 I will look into this
@jefft0 for reference could you output the results of:
ulimit -n
cat /proc/sys/fs/file-max
lsof | wc -l
just before you run a failed command.
I am trying to reproduce the problem using master on Debian GNU/Linux 8.5 @whyrusleeping, @Kubuxu so far I can get this is fail when I add a 1GB file using a ulimit of "8192". What would you say would be a reasonable limit here? For a 128 MB file it is okay at 8192 but fails at 4096 so there might be a leak somewhere.
@kevina ideally, the ulimit would be at 1024. We shouldnt require a huge amount of file descriptors just to add something
@kevina i also wonder if the fd issues have to do with leveldb batching. I don't think we use the leveldb batch anywhere else outside of the filestore
$ ulimit -n
1024
$ cat /proc/sys/fs/file-max
1579052
$ lsof | wc -l
29656
@kevina also might be worth updating our leveldb package, this commit seems suspiciously related: https://github.com/syndtr/goleveldb/commit/62a36c4ec8d8f7f8bbf13599cba7ca5c61b96660
@jefft0 I will see if we can decrease the number of open file descriptors, but for now you can fix your problem my upping the limit. I'm not sure how the ulimit is normally set.
If you have root access you can up the hard limit on the number of open files using:
sudo prlimit -p $$ --nofile=8192
this should set it to 8k for the current shell, to check do another ulimit -n
after.
@whyrusleeping, okay I will look into upgrading the leveldb process and see how low I can get the ulimit.
@kevina just wanted to test my theory about batching, so i disabled it by making the batch code just write directly through to the db. It made the add take slightly longer to fail, but still got a too many FD error nonetheless.
Its all tons of leveldb files, This feels like a leveldb bug to me...
Okay, I found the problem. The leveldb was not being properly compacted and there where too many "*.ldb" files (I counted over 4000), each of them open. I was able to fix the problem by calling:
db.CompactRange(util.Range{nil,nil})
after the database is open in go-ds-leveldb.NewDatastore
. This took awhile when first called but then greatly speed up the add operation (I would guess at least an order of magnitude but I am not sure).
There are a bunch of parameters related to compaction and I'm not sure why the its is not being called automatically in our case.
On a possible related note, we disable compression, I am not sure how much speed benefit this gives us and may also relate to the fact that the leveldb is not automatically compacting (but this is just a wild guess).
I remember we disabled compression back when we were writing all the block data into leveldb, we can probably try turning it back on and seeing how that works.
Though that has nothing to do with compaction as far as i can tell. Its worrying that compaction was never run... I think that happens as a background process. @jefft0 Do you run the ipfs daemon for long periods? Or do you mainly do ipfs add
without the daemon running? It could be that the leveldb process just never gets time to run a compaction if the daemon is never on for long.
@kevina do you think we should just run compaction at daemon startup all the time?
Do you run the ipfs daemon for long periods? Or do you mainly do ipfs add without the daemon running?
In my use case, I am doing the initial add --nocopy
of all my files without running the daemon.
@whyrusleeping we could run compaction when we open the database (as oppose to just starting the daemon). But I see that as a bit of a hack. Note also that the first compaction could take a very long time (over 10 minutes in this case).
Hrm... How about we find a nice way to add it to 'ipfs repo fsck' ?
On Tue, May 16, 2017, 09:28 Kevin Atkinson notifications@github.com wrote:
@whyrusleeping https://github.com/whyrusleeping we could run compaction when we open the database (as oppose to just starting the daemon). But I see that as a bit of a hack. Note also that the first compaction could take a very long time (over 10 minutes in this case).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ipfs/go-ipfs/issues/3763#issuecomment-301837637, or mute the thread https://github.com/notifications/unsubscribe-auth/ABL4HCtJMklkuYvGKsbd5T-lM0BaXdHLks5r6c6YgaJpZM4MV8et .
@jefft0 Ah, that makes sense then. In your usecase, leveldb was never able to successfully complete a compaction run
@whyrusleeping Are you saying that if I start the daemon, it will do a compaction?
... I'll try it. But if memory serves, I was also getting the error when I was running the daemon, trying to serve the files that I had added.
@jefft0 well, in your rather special case, it will likely start a compaction after a few minutes, then fail because of too many open file descriptors. What i would try it raising your ulimit to something above 5000, and then starting the daemon with --offline
(so it doesnt use up any file descriptors making network connections) and wait a bit.
It will likely take half an hour or so, since the compaction code @kevina said "do nothing else, just compact" and the default background compaction tries not to get in the way too much.
Do I need to add this line of code?
db.CompactRange(util.Range{nil,nil})
@jefft0 no, you shouldnt have to. leveldb will do the compaction automatically in the background. Just make sure you run your daemon with a very high ulimit
@whyrusleeping we cold add it to ipfs fsck
for when the compaction does not have a chance to run normally.
@kevina Yeah, i like that idea.
really, we just need to move away from leveldb. It has many issues. I have high hopes for badger, but i'm waiting on a few issues to be resolved: https://github.com/dgraph-io/badger/issues/28
Version information:
Type: bug
Priority: P1
Description:
With a script, I used
ipfs add --nocopy --raw-leaves
to individually add 9110 webcam video files, each about 200 MB. (I did not use recursive add.) This is a fresh installation of the latest code in master. I did not start the daemon. Now, even after restarting my computer, when I try toadd --nocopy
another file, I get the error:The
.ipfs/blocks
folder has 73689 .data files, and the.ipfs/datastore
folder has 4314 .ldb files.Maybe
ipfs add --nocopy
is trying to open all the .ldb files at once?