[WIP] Filestore Implementation

kevina commented 8 years ago

Closes Issue #875, avoid duplicating files added to ipfs

NOT READY FOR MERGE

Rebased #2600 on master.

Quicklinks: Code, README,

TODO to get this merged:

[ ] Rebase on master turning the chain of commits into a reasonable ChangeSet
[ ] Agree on major infrastructure changes
- [ ] Multi-blockstore (#3119)
[x] Separate out non-filestore bits of infrastructure change into there own pull request
- Multi-blockstore: #3257
- DAGService.GetLinks: ~~#3255~~
- Others: see milestone Filestore implementation
[ ] Code review
[ ] Merge

Note The filestore is very basic right now, but it is functional. I will likely continue to improve and submit new pull requests for the enhanced functionally but right now I fell it is important a basic implementation gets in so that it will get used, it can be labeled as an experimental feature and disabled by default, but available for those that want to use it. I consider the code production ready.

Resolves #875

kevina commented 8 years ago

I imagine that this is going to take a while before it gets merged. So for now I think it will be best to maintain this as a separate fork while the goal of eventually merging everything in.

@whyrusleeping I will still like to work with you on API issues and work to get major API changes merged first to avoid stagnation.

I have created a README for the new filestore available here https://github.com/ipfs-filestore/go-ipfs/blob/kevina/filestore/filestore/README.md some notes on my fork are available here https://github.com/ipfs-filestore/go-ipfs/wiki.

@jefft0 if you are interested I could use some serious testing now. Others all of course welcome to test it out.

whyrusleeping commented 8 years ago

@kevina Yeah, theres a LOT here, so its going to take some time. I don't disagree with you maintaining a fork in the meantime. That said, as I find time my plan is to pick out parts of the code that can be merged up easily. I think the next candidate for that is the commands/files code refactor to use absolute paths.

kevina commented 8 years ago

@whyrusleeping, One idea I had which might avoid some special casing is to replicate what is done with add --only-hash

                if hash {
                        nilnode, err := core.NewNode(n.Context(), &core.BuildCfg{
                                //TODO: need this to be true or all files
                                // hashed will be stored in memory!
                                NilRepo: true,
                        })
                        ...
                        n = nilnode
                }

That is create a special "node" for adding files to the filestore. This could help avoid special case code in at least a few places. It will at minimal avoid the special case code in the blockstore package as I can use my own implementation.

Do you think this is an idea worth pursuing? I image the special node will share many of the data structures from the real node to avoid any data race or related issues.

whyrusleeping commented 8 years ago

@kevina I think that might be a good idea, my only concern is that in order to read the blocks out of the filestore would you still need that custom created node?

kevina commented 8 years ago

@whyrusleeping, no the custom node will be for adding only. We will still need some sort of multi-datastore for reading blocks. See my comments for #2747.

kevina commented 7 years ago

@whyrusleeping there is a lot going on in an Ipfs Node. What I need is not a mock node like the code that --only-hash uses but rather create a new Node that acts like a view into the original node except that add's are handled differently. I was not confident I could pull this if with a full Ipfs Node.

With that in mind, I refactored the adder code a little to bundle all the needed services into the DataServices struct see commit c41a47e. A proper interface that exports just methods needed by the adder would be nice, but I think this is a good start. This new interface could be used to simply a lot of code elsewhere, including the code for just computing the hashes, but I don't fully understand what is going on there so I left the code for add --hash-only alone.

I use this new interface to to create a DataServices view in commit e6bfbfa that has uses an alternative Blockstore and leaves the existing Blockstore and caching Blockstore wrapper alone. In commit b54f2f4 I factor out the filestore specific code from the DagServices and instead install small hooks.

In commit f9a51b1 I completely eliminated the AdvReader code you didn't like and instead computed the offset's in the DAG builder code (the Balanced builder to be specific, the Trickle builder doesn't support the filestore just yet).

All in all I cleaned up a lot of the special case code.

I you are okay with the changes in c41a47e I can separate that out into it's own pull request.

whyrusleeping commented 7 years ago

@kevina from irc, we can move towards making the NewAdder signature be: NewAddr(context.Context, bs bstore.Blockstore, pin pinning.Pinner, ds dag.DAGService) which eliminates the need for the partial initialization stuff

kevina commented 7 years ago

@whyrusleeping f5fb3c9 basically does what you ask for.

kevina commented 7 years ago

For anyone who has been using my code, I just pushed some major changes to the filestore. Have a look at the README. There is also a sample script to to add the contents of a directory to the filestore and keep the directory in sync with the filestore.

At this point I could really use some testers.

jefft0 commented 7 years ago

HI Kevin. I've just finished preparing my big store of webcam files. I should be able to pull your new code and try it tomorrow.

kevina commented 7 years ago

@jefft0 thanks. Let me know how it goes. You should be able to add files while the server is online so that should make using it easier.

jefft0 commented 7 years ago

@kevina, I'm on Windows 8.1 because my webcam software (and attached storage) is for Windows. I was able to "make install" in Cygwin. I tried the following in Cygwin:

ipfs filestore add -P -r /cygdrive/f/cameras/2016

It says "The system cannot find the path specified." I also tried the following in a Windows command prompt:

ipfs filestore add -P -r F:\cameras\2016

It says "File path must be absolute." Thinking that maybe it doesn't like Windows drive letters. I changed to the F drive and just tried:

ipfs filestore add -P -r \cameras\2016

with the same result "File path must be absolute."

Sorry to stick you with Windows problems, but this is the system where I can do serious stress testing. Any thoughts?

kevina commented 7 years ago

@jefft0 try dropping the -P, that is just use ipfs filestore add -r /cygdrive/f/cameras/2016 or try ipfs filestore add -r F:\cameras\2016 if that does not work, let's take this to an issue on my fork: https://github.com/ipfs-filestore/go-ipfs/issues

jefft0 commented 7 years ago

I tried that. Same problem. I opened an issue https://github.com/ipfs-filestore/go-ipfs/issues/6 . (I copied the webcam files to the C drive to remove doubts about using the external F drive.)

jefft0 commented 7 years ago

A UI issue. I did ipfs filestore add -r c:\public\cameras on a video collection of about 12 TB. The progress bar only counted 1.32 TB. Indeed, after about a day, the progress bar shows progress over 100%:

 1.45 TB / 1.34 TB [==============================] 108.61% -59m6s

So I think it is processing all the files, but something is wrong with how it makes the initial total. I tried it on other large collections of different sizes over 2 TB and each time the total shows 1.34 TB.

jefft0 commented 7 years ago

Let's move discussion of the progress bar total to https://github.com/ipfs-filestore/go-ipfs/issues/8 .

kevina commented 7 years ago

@whyrusleeping, I think I figured out how to get ride of the DataPtr that you didn't like, but I am not 100% sure I like the changes. I pushed the changes (see https://github.com/ipfs-filestore/go-ipfs/compare/d3bf502...ipfs-filestore:no-altdata) on the ipfs-filestore:no-altdata branch and have not pushed the changes to this pull request yet.

The filestore requires an alternative encoding of the protocol buffer that doesn't contain the Unixfs Data field (i.e. the data from the file). Previously the DataPtr was used to build this alternative encoding. In the latest change I basically don't worry about it and then just reencode the protocol buffer and remove the Data field just before sending it to the filestore. Now the only thing that is stored in the merkledag.Node is the PosInfo, all the code related to the DataPtr and AltData is gone.

I originally avoided this because I thought it would be slow. It might still be a little, but not noticeably so in my very informal tests.

Let me know what you think. Also If you are okay with storing PosInfo in blocks.Block I can eliminate the hooks in merkledag/merkledag.go. I will instead move all the logic to my alternative blockstore.

Thanks

whyrusleeping commented 7 years ago

@kevina That looks much cleaner, I think that moving the PosInfo into the blocks is a good idea and I doubt that we will see a perf hit from doing things this way. (especially since we're comparing perf relative to disk writes and syncs).

Would you like to schedule a group hangout sometime to talk more about this PR? We've had a lot of increased interest lately in this feature from a few different groups. I think it makes sense for us to catch up and get a good plan forward on getting this in :)

cc @jbenet @lgierth @Kubuxu

kevina commented 7 years ago

@whyrusleeping, okay I pushed the changes from the no-altdata onto this pull request. I also moved PosInfo into blockstore and eliminated the hooks in merkledag/merkledag.go.

As we discussed over IRC I will write up some sort of document on the major issues that need to be resolved before I am comfortable merging this in and we can go from there.

kevina commented 7 years ago

@whyrusleeping I posted a draft document here https://github.com/ipfs-filestore/go-ipfs/wiki/Landing, please look it over and let me know what to do with it. In particular I stated to discuss the first issue "Need Some Sort of Support For Multiple Datastores" in #2747, should we keep the discussion there (I can update the description) or is there a better place. Thanks.

kevina commented 7 years ago

@whyrusleeping is there anything I can do at this point to move this forward?

whyrusleeping commented 7 years ago

@kevina I need to get some more review in on this. In the meantime, if you want to start hacking on the 'best effort' pins, that would be helpful

jefft0 commented 7 years ago

Stress test results: On my Windows 8.1 desktop,I did ipfs filestore add -r on a video collection of 12.5 TB, total processing hours of 115 hours. The sizes of my .ipfs folders are:

blocks       175 MB
datastore    916 KB
filestore-db 9.66 GB (5,333 ldb files)

RichardLitt commented 7 years ago

@kevina Thanks for being patient - just trying to make this conform to the rest of the codebase. :)

kevina commented 7 years ago

@whyrusleeping @Kubuxu since it looks like we are moving towards using base32 keys I am thinking about killing the datastore level in my filestore and just directly implementing a blockstore as I am not sure they is any benefit to implementing it as a datastore for this. The "datastore" I implement is special cased for IPFS nodes anyway and will serve little use as a general purpose datastore, so what's the point. Thoughts?

kevina commented 7 years ago

The merging of #2903 will create a non-backwards completable change to the filestore once I merge with master. I am undecided on how to handle the upgrade.

@jefft0 (and anyone else) are you using the filestore in a (semi) production environment or just for testing? The easiest path would be to require that all files are simply readded, although with a little effort I could do something better that won't require rereading the file contents in most cases. Thanks.

jefft0 commented 7 years ago

@kevina, don't let me slow you down. I realize we're at a testing stage. I can take another 115 hours and re-add the files.

kevina commented 7 years ago

@whyrusleeping it turns out that @Kubuxu bloom filter significantly complicates things. In order to even merge master I need to implement some form of multi-datastore support (or just disable the Bloom filter in my branch as a temporary hack). The current hacks I have in there now won't cut it as the bloom filter will not recognize the content in the filestore.

I realize that everyone is busy, but I would like to see some forward progress in working through the issues outlined at: https://github.com/ipfs-filestore/go-ipfs/wiki/Landing. If it will help move things forward we could do a group hangout with interested parties. Note that for me it will need to be voice only.

I am very tempted just to implement my plan for Multi-datastore support just so that I can merge with master, but I am hesitant to do so as I already made a number of design choices without really getting support from other core IPFS community members.

jefft0 commented 7 years ago

Are you using the filestore in a (semi) production environment or just for testing? The easiest path would be to require that all files are simply readded, although with a little effort I could do something better that won't require rereading the file contents in most cases.

@kevina, When do you plan to merge the change to the filestore format?

kevina commented 7 years ago

@jefft0 right now I am waiting for feedback from @whyrusleeping on my last comment (concerning the bloom filter). If enough times passes without any feedback I could hack something by disabling the bloom filter but I rather not.

whyrusleeping commented 7 years ago

@kevina sorry for taking so long to get back to you. It has been a very hectic three weeks and i'm just now getting back in the swing of things on github.

I think you are probably good to start implementing the 'multi datastore' idea. Take a look at the tiered datastore we have already. Its not quite what you want, but it might be a good model to start with.

The abstractions should be set up such that any layering of blockstores are able to correctly handle the blocks being stored.

Responding to some of your points in your filestore proposal doc:

For the garbage collector, I think that adding a GetLinks method to the dagservice isnt a bad idea. It would as you say be pretty useful in many situations (the resolver could use this optimized method to speed up graph traversals)
on invalid blocks, We probably do want to try and continue on errors. This is a difficult problem to solve, we need to change the API so that those relevant calls can return errors alongside their other results, and we also need to figure out what a proper return code for such a command result would be. Does a command that enounters invalid blocks exit 'successfully'?
Re implementing the filestore as a blockstore, This is an interesting idea. I'm not sure how everything would fit together with that. If you think thats a good way forward i'd like to see a short writeup on how that would be constructed.

kevina commented 7 years ago

@jefft0 I merged the changes in the filestore format and created a way to upgrade the filestore. The upgrade code turned out to be really simple.

First upgrade the repo (it can be trigged automatically if you use ipfs daemon). After the repo is upgraded run ipfs filestore upgrade to fix the keys in the filestore that are in the original format. It is best to do this with the daemon shutdown (it could in theory work with the daemon running, but this has not been tested).

If a key is mangled it will need to read the contents from the file to rehash. If it is unable to do so for some reason it will output a warning and use the mangled hash. You can inspect the managed hash with ipfs filestore ls and it can be manually removed using ipfs filestore rm. (ipfs clean ... will also likely remove it).

If the process is interrupted, don't worry, just run it again and it will pick up from where it left off.

If you have any problems please file a issue at https://github.com/ipfs-filestore/go-ipfs/issues.

kevina commented 7 years ago

Also note that I have not yet fixed the filestore to work with the Bloom filters. This should not be a problem though as the Bloom filters are now disabled by default.

jefft0 commented 7 years ago

ipfs filestore upgrade has been running for a day now. Is there any way to find out the progress?

kevina commented 7 years ago

@jefft0 I am really surprised it is taking that long, sorry I didn't think about including some sort of progress indicator. You could try stopping it than do a ipfs filestore ls. If an entry is not converted the key will start with '???`. You can then restart it and it should pick up from where it left off.

jefft0 commented 7 years ago

@kevina Checking for ??? is good enough. I stopped ipfs filestore upgrade and ipfs filestore ls showed only 2% remaining with ???. I restarted upgrade and it kept running, so I stopped it again and ls shows all converted with no ???. But when I start upgrade again, it keeps running anyway. I will stop it and use the filestore as normal to see if I come across any problems.

kevina commented 7 years ago

Sorry, ipfs upgrade is very simple. It will always run, it just won't do anything if there is nothing to upgrade. As long as you have no '???' you should be okay.

jefft0 commented 7 years ago

@kevina, should I be testing with these latest commits, or wait for more commits?

kevina commented 7 years ago

@jefft0 I would wait, some more commits are pending

kevina commented 7 years ago

@whyrusleeping Thanks for your feedback. The filestore is now in a state where it has sensible semantics with regards to the rest of IPFS.

I decided to go with a multi-blockstore but to keep the filestore as a datastore.

Many of the details or in the commit logs, but I will spell them out in greater detail this week. Once the core developers are in agreement on the semantics I will start to rebase and turn this into a nice set of commits and factor out what I can into separate pull requests.

@jefft0 (and others) now would be a good time for some more testing. There are still some rough edges but things should be usable. The core functionally is the same, but a lot of the details have changed so might want have another look at the README. (There are no changes to repo.)

One think of note is I no longer pin objects in the filestore. If you added the files using "-r" the directory object got pinned and hence everything under it is indirectly pinned. This will cause problems if you ever need to remove files so if you don't care about the directory object you might want to unpin it.

jefft0 commented 7 years ago

if you don't care about the directory object you might want to unpin it.

What are the best commands to find pinned directories and to unpin them?

kevina commented 7 years ago

@jefft0 Something like:

ipfs pin ls --type=recursive -q # to find pins
ipfs ls <HASH>                  # to discover the contents
ipfs pin rm -r <HASH>           # to remove the pins

Hopefully you don't have too many pins. If not you do you could do some basic scripting to dump the contents of all the pins in a readable fashion. For example using the Unix sh I would do:

for hash in `ipfs pin ls -q --type=recursive`; do echo "*** $hash"; ipfs ls $hash; done

If any pins are a directory, it will show something like

*** QmWczekSRtsGdqJGASEgb2WPMApFUY9Epa7HLhzfY3vNnP
QmQUZzkwP6aLEyuhZbzkNznfwhV59e4RmjN7PbH89Fc5pS 3316736  12Dicts/
QmSbBy1izdeC43KRBPK5jEiVRshS9GboLWhL44F2YVMYgY 15998763 AGID/
QmccKNwQc6VVxkCPVtM5hEZWbYt2dRdyspKAw4T7xLg7tP 8072754  Alt12Dicts/
...

jefft0 commented 7 years ago

Thanks for the help. I unpinned my files and directories. It makes sense that filestore blocks are not pinned. Is there a way to pin the directory that contains them (maybe recursively) without pinning the filestore files it contains?

kevina commented 7 years ago

@jeff0 you can use a direct pin for each directory, but it is not recursive so it will involve a lot of pins. Anything under the mfs, see "ipfs files", will be now be pinned on a best effort bases so you could copy the directory there using "ipfs files cp". Currently there is no way to directly create a best-effort pin.

kevina commented 7 years ago

@whyrusleeping @jbenet I have updated the description with a outline as how I see this pull request getting merged. We can start with #3119.

kevina commented 7 years ago

All: Now would be a good time for another round of testing. There is a new script "add-dir" to aid in keeping a directory in sync.

kevina commented 7 years ago

Merged with master. This was a major PITA. I did it in pieces so I can be sure I handled all the conflicts correctly. There was one breaking change (see #3253) and the switch to CidV0 took some work to fix. @whyrusleeping it would be really nice if we could get the core functionally in ASAP.

kevina commented 7 years ago

All: Sometime within the next week I am likely to start redoing the commits. In process I am likely to remove any backwards compatibility bits in the code. If you are using this code you should run filestore upgrade to make sure it will still work with the new version. Old versions of the code will still able to be found in the master branch of my fork.

P.S: @whyrusleeping @jbenet @Kubuxu (and anyone else) if any of you have any tips on how to go about redoing this massive 100+ set of commits let me know. I would say around 50% (or less) of these commits are still relevant. Some of the commits are now merged upstream, but unfortunately not in a way that git cherry will be of any use. Many of the others have effectively been reverted by later commits.

Interactive rebasing is likely to be conflict-ridden for many reasons, one of them being due to the many changes in the code base since May. The only thing I can think of is to work with the final diff and work backwards to pick out the commits that are needed for the end result. I hope we can agree Squashing is not something that should be done for such a major change. :smile:

If no one has any ideas, that's fine. I'm sure i will think of something. :smile:

kevina commented 7 years ago

@whyrusleeping (and others), I have separated out the important non-filestore bits into separate pull requests. Note that many of them will conflict with each other, but I can easily rebase the unmerged ones master to fix the conflicts.

kevina commented 7 years ago

Note to see what changes remain after all the outstanding pull requests are merged:

git remote add filestore https://github.com/ipfs-filestore/go-ipfs.git
git fetch filestore
git diff --stat 1872685..f4785481 -- . ':!*filestore*' ':!*t026*.sh'

This will also exclude any new files created. The results should be:

 blocks/blocks.go                   | 14 +++++++++++++-
 cmd/ipfs/ipfs.go                   |  2 ++
 commands/cli/cmd_suggestion.go     |  4 ++++
 commands/cli/parse.go              | 49 +++++++++++++++++++++++++++----------------------
 commands/files/file.go             | 11 +++++++++++
 core/builder.go                    | 12 +++++++++++-
 core/commands/add.go               | 22 +++++++++++++++++++++-
 core/commands/block.go             | 60 ++++++++++++++++++++++++++++++++----------------------------
 core/commands/root.go              |  1 +
 core/coreunix/add.go               | 26 +++++++++++++++++++++-----
 importer/balanced/builder.go       | 12 +++++++++---
 importer/chunk/rabin.go            | 10 ++++++++--
 importer/chunk/splitting.go        |  5 +++++
 importer/helpers/dagbuilder.go     | 43 ++++++++++++++++++++++++++++++++++++-------
 importer/helpers/helpers.go        | 17 +++++++++++++++--
 merkledag/node.go                  | 11 +++++++++++
 repo/config/config.go              |  1 +
 repo/config/datastore.go           |  5 +++++
 repo/fsrepo/defaultds.go           | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 test/sharness/t0235-cli-request.sh |  2 +-
 20 files changed, 280 insertions(+), 73 deletions(-)

ipfs / kubo

[WIP] Filestore Implementation #2634