Closed Stebalien closed 5 years ago
This is beautiful, thank you. :heart:
Reading through this and thinking a bit, It seems like the filestore/urlstore logic should be self-contained entirely in a special chunker. The fact that its leaking into the dagbuilderhelper importer stuff appears to be unnecessary.
Cleaning that up might make it easier to integrate both the filestore and the urlstore in the same codebase
-- more thinking --
For this to work, the import would need to be changed to so that it accepts a chunker with an interface that gives it just a cid and a length (and changing the chunker to be responsible for putting the blocks to the datastore. Switching things up might actually make it easier for us to optimize the importing process by adding batched writes of the file being chunked in the chunker itself.
attempting to bring this up to date with master to potentially resolve some issues that IA is seeing (https://github.com/protocol/collab-internet-archive/issues/13)
most of the conflicts seem to be caused (or made worse by) unstable import sort order, is this not something gofmt takes care of for us?
I've rebased it.
Tests are failing, I didn't have time to check why.
not finding gx/ipfs/QmWo8jYc19ppG7YoTsrr2kEtLRbARTJho5oNXFTR6B7Peq/go-ipfs-chunker
, can I get push access to the repo so I can diagnose without PRing from my fork?
side note: this is probably a (long) separate discussion, but the goimports behavior with gx seems to introduce a nontrivial overhead with constantly reordering imports based on the hash, possible to patch it to sort based on alias instead?
for example, this rebase could be done automatically (--theirs
semantics) if not for non-deterministic sorting
side note: this is probably a (long) separate discussion, but the goimports behavior with gx seems to introduce a nontrivial overhead with constantly reordering imports based on the hash, possible to patch it to sort based on alias instead?
still don't have push access, unbroken here: https://github.com/ipfs/go-ipfs/pull/5096
Sorry that I've missed it.
hi guys, in order to ensure readiness for the demo at DWeb summit (7/30-ish), we need to get this merged
the scenario I am particularly concerned about is breaking changes introduced at the last minute in master that would require panicked rebasing, a sure recipe for demofail (precedent: the rebase in https://github.com/ipfs/go-ipfs/pull/4896#issuecomment-394773004 was required because accessing unixfs objects broke, thus breaking serving the static frontend of the IA viewer)
by @bigs's estimate there's about an engineer-week of work to get this in good shape (probably less for minimum viable shape)
can we get some eng time allocated to this? 100% understand that go-ipfs team is stretched thin already, but this is a relatively small linchpin that holds together a lot of other work and not getting it done would jeopardize an important opportunity to showcase IPFS to the world
@Kubuxu @whyrusleeping
Could I get some background on what lead to the p.r. and why is it needed?
If I understand what is going allows storing a url in place on an actual object. When it comes time to retrieve the object the URL is fetched instead. I can see how this can cause all sorts of problems, not the least of which is that URL can be very unstable and there is no guarantee that the content will stay around. Retrieving the content could also be very slow.
If I understand what is going allows storing a url in place on an actual object. When it comes time to retrieve the object the URL is fetched instead. I can see how this can cause all sorts of problems, not the least of which is that URL can be very unstable and there is no guarantee that the content will stay around. Retrieving the content could also be very slow.
This is designed for cases where the user controls both the server and the IPFS node. Basically, it's treating a remote HTTP server as a filesystem.
Specifically, the internet archive needs this so they can serve their files over IPFS without moving their entire infrastructure over to IPFS.
@Stebalien if this is something we decide we want to get in I will be happy to take it over if it will help. It seams simple enough.
yep, this is not intended to be used with arbitrary remote URLs -- one of the merge readiness factors is probably to document that
the idea is to provide an on-demand bridge for lazily serving objects out of existing collections behind HTTP interfaces, like archives, libraries, museums, etc
it's theoretically possible to make this work with WebDAV/FUSE, but sparse accesses into a 20PB+ collection (in the case of IA) are not well served by that approach
@parkan does the current code work for you or did it at one point?
I am also curies on if you use plain http
or use https
. If I was doing this I would not bother with https
as that just adds unnecessary overhead, especially if this is on a local network.
@kevina the current code wfm right now, the issue is getting the code quality (testing, documentation, etc) up to the point of being acceptable for merge
let me see about https -- agree with your concern but given that we have one primary user (IA) driving this right now it depends largely on how it's set up for them
let me see about https -- agree with your concern but given that we have one primary user (IA) driving this right now it depends largely on how it's set up for them
Okay, but it also not a big deal as we just use the go http library and let it handle the details.
@kevina thank you for taking up the initiative on this! 💯
@whyrusleeping @Kubuxu okay if I take this over? If no one objects in the next 12-24 hours I am going to start by rebasing it.
@kevina Yeah, go ahead. let's get it rebased, and add some docs, and do some mild cleanup to make things a bit more modular.
Rebased. This uses a a gx-dep of a branch in go-ipfs-posinfo: https://github.com/ipfs/go-ipfs-posinfo/tree/feat/ai-mirror
@whyrusleeping
Cleaning that up might make it easier to integrate both the filestore and the urlstore in the same codebase
Actually right now the filestore and urlstore are essentially using the same code base.
By doing a prefix check I was able to minimize the changes required.
Since this is basically implemented using the filestore I am thinking we should just use add a --url
flag to ipfs add
to avoid the special command, but if we are under time pressure that can come later.
This needs sharness tests, but I am unsure how to proceed. To test it we need a simple web server that supports the Range
header. Any suggestions?
Hmm, we are using some python here (in gencmdref) but unfortunately simplehttpserver doesn't support Range.
The gateway server (based on net/http) seems to support it though (Accept-Ranges: bytes
), so maybe worth writing a simple handler?
re: command syntax, are you imagining using --url
in place of --nocopy
?
@parkan, it won't replace --nocopy
. --nocopy
will be used when adding files and --url
will be used when providing a url.
I can probably write an http server in go fairly quickly. I am asking in case we already have something in place. cc: @whyrusleeping @Stebalien
@kevina A --url
flag sounds fine. However, for your consideration...
My ideal solution would be to continue using --nocopy
but allow URL arguments. That way, ipfs add https://ipfs.io
would mirror the ipfs homepage in IFPS and ipfs add -r https://ipfs.io
would mirror the entire site. When --nocopy
is supplied, we'd just use the filestore instead of adding the blocks to the datastore. For now, we could just return an error when a user passes a URL without specifying --nocopy
.
However, I'm not sure how well the commands lib will deal with "file or URL parameters". Thoughts?
@Stebalien a url has no structure, we would have to reimplemented a web crawler for a recursive add to work. Or am I missing something?
a url has no structure, we should have to reimplemented a web crawler for a recursive add to work. Or am I missing something?
You're not missing anything. Recursive adds of URLs would be a distant future feature. However, it would make a lot of users very happy.
Added test cases, @Stebalien is this about what you had in mind from out IRC conversation.
However, I'm not sure how well the commands lib will deal with "file or URL parameters". Thoughts?
Considering that @parkan wants to try to get this in ASAP I think, for this p.r. I will see if I can make the --url
flag work quickly (and if not keep the current special command) and save the rest for latter.
Actually adding a --url
to add
will be involved since the add command expects file arguments. Thus a rework of the commands library or at least the add
command will be needed. One option is to make url
a sub command of add. That is ipfs add url <URL> ...
. This will allow the command to recognize the same flags, but this will still require some work.
Also it seams that the Blockservice has no way to return errors if it can't retrieve a block. This is causing one of my tests to fail and I am not sure how best to fix it.
One option is to make url a sub command of add. That is ipfs add url
.... This will allow the command to recognize the same flags, but this will still require some work.
That would break:
> touch url
> ipfs add url
We could add a separate add-url
command but we might just want to add a ipfs filestore add
command. We could also rename filestore
to something more generic but that may not be worth it.
For now, let's just add an ipfs filestore add
command, fine if it only supports URLs for now. Later we can rename (alias) filestore to some better name. Augmenting ipfs add
to support URLs will be really nice to have later.
That would break:
I didn't think of that.
We could add a separate add-url command but we might just want to add a ipfs filestore add command.
Any separate command to add files will require some refactoring of the add command in order to avoid duplicating too much code and to recognize the same flags. Since @parkan wants this in quickly I am thinking we should just stick with what we have now (a very simple ipfs urlstore add) and label the command as temporary and explain it is likely to go away or the semantics change once a better solution is found.
A proper refactoring of the add command will likely take me a few days, but them its needs to properly reviewed which will could take weeks unless we make this a high priority and I can get rapid feedback.
@whyrusleeping I was about to ask you the same thing. This should be ready for review then. The only concern is the failing test, but I am not sure how to fix that right now as it a problem with the blockservice.
@kevina what exactly is the problem with the blockservice?
@whyrusleeping it does not seam to return any sort errors when some blocks can not be found. This test (https://github.com/ipfs/go-ipfs/pull/4896/files#diff-d911cc83ed0b333bf4b050510b334ba3R102) is failing when it should not. I have verified that the blocks are not available and yet the client is not reporting an error:
test_expect_failure "files can not be retrieved via the urlstore" '
test_must_fail ipfs get $HASH1 &&
test_must_fail ipfs get $HASH2
'
Interesting it also seams to not write any output. Just returns success without writing anything. I have not been able to track down the exact cause, but one problem is the interface to the blockservice does not provide any way to return errors when blocks can't be found.
@kevina does it still fail that way if you use ipfs cat
?
@whyrusleeping
@kevina does it still fail that way if you use ipfs cat?
No it seams ipfs cat fails. So somewhere some error code is getting dropped but I have not figured out where yet. Any ideas?
@kevina I suspect you found a bug in ipfs get
. Likely in how the unixfs/archive code handles batch get failures. I'd say just use cat for now, and let's look into the get issue separately
nice, @kevina! will review this first thing tmw
@whyrusleeping can you sign off?
@whyrusleeping should I rebase this now or wait until 0.4.16 lands (to avoid having to rebase it a second time)
@kevina I'd just wait until 0.4.16 lands. I don't think anything else is going to land between then and now, but just in case.
merging first thing after 0.4.16 sgtm -- ideally this would be before the dev days (planning to demo there and would prefer to do it from master)
thanks again @kevina!
@parkan your welcome. I have this rebased within a day after 0.4.16 lands. If it is before the dev conf, well that depends on @whyrusleeping.
@whyrusleeping @parkan just rebased now that it looks like we finally got a release out.
This is an experimental URL store feature that needs some discussion/design work before we even consider merging it. I'm opening this PR at @diasdavid's request so we don't lose track of it.