Slowly handling large number of files

jdgcs commented 7 years ago

Version information:

% ./ipfs version --all
go-ipfs version: 0.4.4- Repo version: 4 System version: amd64/freebsd Golang version: go1.7

Type:

./ipfs add became very slow when handling large number of files.

Priority:P1

Description:

./ipfs add became very slow when handling about 45K files(~300GB), it took about 3+ seconds to wait after the process bar finished.

But we can run several IPFS in the same machine to deal with this issue.

About the machine: CPU:E3-1230V2, RAM:16G, storage:8T with 240G SSD cache@ZFS

Thanks for the amazing project!

jdgcs commented 7 years ago

% ./ipfs repo stat NumObjects 9996740 RepoSize 381559053765 RepoPath /home/liu/.ipfs Version fs-repo@4

FortisFortuna commented 6 years ago

I encounter this too.

schomatis commented 6 years ago

Hey @FortisFortuna, yes, this is a common issue with the default flatfs datastore (which basically stores each 256K-chunk of each file being added in a different file in the repository and ends up collapsing the filesystem), could you try the badgerds datatsore and see if it helps? (Initialize the repository with the --profile=badgerds option).

FortisFortuna commented 6 years ago

thank you ipfs config profile apply badgerds ipfs-ds-convert convert

I have about 18 GB and 500K files (Everipedia) on the default flatfs. Do these commands convert the blocks from flatfs to badgerds so I don't have to do everything over again?

Stebalien commented 6 years ago

Yes. However, it may be faster to do it over as this will still have to extract the blocks from flatfs and move them into badgerds.

schomatis commented 6 years ago

Yes, keep in mind the conversion tool will require twice the size of the repo being converted.

Stebalien commented 6 years ago

Also, I'd be interested in how big your datastore gets with badgerds.

FortisFortuna commented 6 years ago

I am unable to build the conversion tool. It stalls for me on the make inside ipfs-ds-convert [0 / 22]

Stebalien commented 6 years ago

Looks like you're having trouble fetching the dependencies. Try building https://github.com/ipfs/ipfs-ds-convert/pull/11.

FortisFortuna commented 6 years ago

Ok thanks! The pull request you did let me build it. I will follow the instructions in this thread and #5013 now and try to convert the db (I backed up the flatfs version just in case). Thanks for the quick reply.

FortisFortuna commented 6 years ago

Works, but still the same slow speed

schomatis commented 6 years ago

@FortisFortuna That's strange, I would definitely expect a speed up when using Badger instead of the flat datastore when adding files, I mean I can't say that it would be fast, but it should be noticeably faster than you're previous setup.

Could you, as a test, initialize a new repo with the --profile=badgerds option and add a small sample of your data set (say 30GB) to check if you experience different speeds when writing than with flatfs. (Badger's performance may degrade with bigger data sets but not to the point of being comparable with flatfs so this test should be representative enough to check that you're setting everything properly on your end, and in that case we should investigate further on our -or Badger's- side.)

Stebalien commented 6 years ago

Hm. Actually, this may be pins. Are you adding one large directory or a bunch of individual files? Our pin logic is really optimized at the moment so if you add all the files individually, you'll end up with many pins and performance will be terrible.

FortisFortuna commented 6 years ago

Everipedia has around 6 million pages, and I have IPFS'd about 710K of them in the past week on a 32 core 252G RAM machine. Something is bottlenecking because I am only getting about 5-10 hashes a second. I know for a fact the bottleneck is the ipfs add in the code. The machine isn't even running near full capacity. I am using this: https://github.com/ipfs/py-ipfs-api

import ipfsapi api = ipfsapi.connect('127.0.0.1', 5001) res = api.add('test.txt')

Specifically, a gzipped html file of average size ~15 kB is being added each loop.

Stebalien commented 6 years ago

Ah. Yeah, that'd do it. We're trying to redesign how we do pins but that's currently under discussion.

So, the best way to deal with this is to just add the files all at once with ipfs add -r. Alternatively, you can disable garbage collection (don't run the daemon with the --enable-gc flag) and just add the files without pinning them (use pin=False).

FortisFortuna commented 6 years ago

i will try pin=False. I need to keep track of which files get which hashes through, so I don't think I can simple pre-generate the html files, then add, unless you know a way. If I skip the pinning, will I still be able to ipfs cat them?

Stebalien commented 6 years ago

Once you've added a directory, you can get the hashes of the files in the directory by running either:

ipfs files stat --hash /ipfs/DIR_HASH/path/to/file to get the hash of an individual file.
ipfs ls /ipfs/DIR_HASH to list the hashes/names of all the files in a directory.

Note: If you're adding a massive directory, you'll need to enable (directory sharding](https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#directory-sharding--hamt) (which is an experimental feature).

FortisFortuna commented 6 years ago

Thanks

FortisFortuna commented 6 years ago

So to clarify, if I put pin=False, I can still retrieve / cat the files right, as long as I keep garbage collection off? I noticed a gradual degradation in file addition speed as more files were added.

FortisFortuna commented 6 years ago

You are a god among men @Stebalien. Setting pin=False int the Python script did it! To summarize 1) Using badgerrds 2) using ---offline 3) ipfs config Reprovider.Strategy roots 4) ipfs config profile apply server 5) set pin=False when ipfs add-ing in my Python script.

Getting like 25 hashes a second now vs 3-5 before. ipfs cat works too

FortisFortuna commented 5 years ago

Hey guys. The IPFS server was working fine with the above helper options, until I needed to restart. When I did, the daemon tries to initialize, but freezes. I am attempting to update from 0.4.15 to 0.4.17 to see if that helps, but now it stalls on "applying 6-to-7 repo migration". I have over 1 million IPFS hashes (everipedia.org). Anything I am doing wrong?

FortisFortuna commented 5 years ago

I see this in the processes "/tmp/ipfs-update-migrate590452412/fs-repo-migrations -to 7 -y" could it be I/O limitations?

FortisFortuna commented 5 years ago

Ok, so the migration did eventually finish, but it took a while (~ 1 hr). Once the update went through, the daemon started fast. It is working now.

Stebalien commented 5 years ago

So, that migration should have been blazing fast. It may have been that "daemon freeze". That is, the migration literally uses the 0.4.15 repo code to load the repo for the migration.

It may also have been the initial repo size computation. We've switched to memoizing the repo size as it's expensive to compute for large repos but we have to compute it up-front so that might have delayed your startup.

hoogw commented 5 years ago

ipfs add -r 289GB( average file size < 10MB) after add 70GB, speed noticeable slowing done, spend 2 days to reach 200GB,

Do you means to speed up by (go-ipfs v0.4.18) ipfs add pin=false -r xxxxxxxxx ?

Is this right?

Stebalien commented 5 years ago

@hoogw please report a new issue.

hoogw commented 5 years ago

ipfs add (by default --pin = true)

To turn off pin to speed up,

D:\test>ipfs add --pin=false IMG_1427.jpg 4.18 MiB / 4.18 MiB [========================================================================================] 100.00%added QmekTFtiQqrhiqms8FXZqPD1TfMc9kQUoNF8WVUNBGJF8h IMG_1427.jpg 4.18 MiB / 4.18 MiB [========================================================================================] 100.00% D:\test>

ipfs / kubo