ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
15.83k stars 2.96k forks source link

URL store #4896

Closed Stebalien closed 5 years ago

Stebalien commented 6 years ago

This is an experimental URL store feature that needs some discussion/design work before we even consider merging it. I'm opening this PR at @diasdavid's request so we don't lose track of it.

kyledrake commented 6 years ago

This is beautiful, thank you. :heart:

whyrusleeping commented 6 years ago

Reading through this and thinking a bit, It seems like the filestore/urlstore logic should be self-contained entirely in a special chunker. The fact that its leaking into the dagbuilderhelper importer stuff appears to be unnecessary.

Cleaning that up might make it easier to integrate both the filestore and the urlstore in the same codebase

-- more thinking --

For this to work, the import would need to be changed to so that it accepts a chunker with an interface that gives it just a cid and a length (and changing the chunker to be responsible for putting the blocks to the datastore. Switching things up might actually make it easier for us to optimize the importing process by adding batched writes of the file being chunked in the chunker itself.

parkan commented 5 years ago

attempting to bring this up to date with master to potentially resolve some issues that IA is seeing (https://github.com/protocol/collab-internet-archive/issues/13)

most of the conflicts seem to be caused (or made worse by) unstable import sort order, is this not something gofmt takes care of for us?

Kubuxu commented 5 years ago

I've rebased it.

Kubuxu commented 5 years ago

Tests are failing, I didn't have time to check why.

parkan commented 5 years ago

not finding gx/ipfs/QmWo8jYc19ppG7YoTsrr2kEtLRbARTJho5oNXFTR6B7Peq/go-ipfs-chunker, can I get push access to the repo so I can diagnose without PRing from my fork?

parkan commented 5 years ago

side note: this is probably a (long) separate discussion, but the goimports behavior with gx seems to introduce a nontrivial overhead with constantly reordering imports based on the hash, possible to patch it to sort based on alias instead?

for example, this rebase could be done automatically (--theirs semantics) if not for non-deterministic sorting

Stebalien commented 5 years ago

side note: this is probably a (long) separate discussion, but the goimports behavior with gx seems to introduce a nontrivial overhead with constantly reordering imports based on the hash, possible to patch it to sort based on alias instead?

Yeah... https://github.com/ipfs/go-ipfs/issues/4831

parkan commented 5 years ago

still don't have push access, unbroken here: https://github.com/ipfs/go-ipfs/pull/5096

Kubuxu commented 5 years ago

Sorry that I've missed it.

parkan commented 5 years ago

hi guys, in order to ensure readiness for the demo at DWeb summit (7/30-ish), we need to get this merged

the scenario I am particularly concerned about is breaking changes introduced at the last minute in master that would require panicked rebasing, a sure recipe for demofail (precedent: the rebase in https://github.com/ipfs/go-ipfs/pull/4896#issuecomment-394773004 was required because accessing unixfs objects broke, thus breaking serving the static frontend of the IA viewer)

by @bigs's estimate there's about an engineer-week of work to get this in good shape (probably less for minimum viable shape)

can we get some eng time allocated to this? 100% understand that go-ipfs team is stretched thin already, but this is a relatively small linchpin that holds together a lot of other work and not getting it done would jeopardize an important opportunity to showcase IPFS to the world

@Kubuxu @whyrusleeping

kevina commented 5 years ago

Could I get some background on what lead to the p.r. and why is it needed?

If I understand what is going allows storing a url in place on an actual object. When it comes time to retrieve the object the URL is fetched instead. I can see how this can cause all sorts of problems, not the least of which is that URL can be very unstable and there is no guarantee that the content will stay around. Retrieving the content could also be very slow.

Stebalien commented 5 years ago

If I understand what is going allows storing a url in place on an actual object. When it comes time to retrieve the object the URL is fetched instead. I can see how this can cause all sorts of problems, not the least of which is that URL can be very unstable and there is no guarantee that the content will stay around. Retrieving the content could also be very slow.

This is designed for cases where the user controls both the server and the IPFS node. Basically, it's treating a remote HTTP server as a filesystem.

Specifically, the internet archive needs this so they can serve their files over IPFS without moving their entire infrastructure over to IPFS.

kevina commented 5 years ago

@Stebalien if this is something we decide we want to get in I will be happy to take it over if it will help. It seams simple enough.

parkan commented 5 years ago

yep, this is not intended to be used with arbitrary remote URLs -- one of the merge readiness factors is probably to document that

the idea is to provide an on-demand bridge for lazily serving objects out of existing collections behind HTTP interfaces, like archives, libraries, museums, etc

it's theoretically possible to make this work with WebDAV/FUSE, but sparse accesses into a 20PB+ collection (in the case of IA) are not well served by that approach

kevina commented 5 years ago

@parkan does the current code work for you or did it at one point?

I am also curies on if you use plain http or use https. If I was doing this I would not bother with https as that just adds unnecessary overhead, especially if this is on a local network.

parkan commented 5 years ago

@kevina the current code wfm right now, the issue is getting the code quality (testing, documentation, etc) up to the point of being acceptable for merge

let me see about https -- agree with your concern but given that we have one primary user (IA) driving this right now it depends largely on how it's set up for them

kevina commented 5 years ago

let me see about https -- agree with your concern but given that we have one primary user (IA) driving this right now it depends largely on how it's set up for them

Okay, but it also not a big deal as we just use the go http library and let it handle the details.

parkan commented 5 years ago

@kevina thank you for taking up the initiative on this! 💯

kevina commented 5 years ago

@whyrusleeping @Kubuxu okay if I take this over? If no one objects in the next 12-24 hours I am going to start by rebasing it.

whyrusleeping commented 5 years ago

@kevina Yeah, go ahead. let's get it rebased, and add some docs, and do some mild cleanup to make things a bit more modular.

kevina commented 5 years ago

Rebased. This uses a a gx-dep of a branch in go-ipfs-posinfo: https://github.com/ipfs/go-ipfs-posinfo/tree/feat/ai-mirror

kevina commented 5 years ago

@whyrusleeping

Cleaning that up might make it easier to integrate both the filestore and the urlstore in the same codebase

Actually right now the filestore and urlstore are essentially using the same code base.

kevina commented 5 years ago

By doing a prefix check I was able to minimize the changes required.

Since this is basically implemented using the filestore I am thinking we should just use add a --url flag to ipfs add to avoid the special command, but if we are under time pressure that can come later.

This needs sharness tests, but I am unsure how to proceed. To test it we need a simple web server that supports the Range header. Any suggestions?

parkan commented 5 years ago

Hmm, we are using some python here (in gencmdref) but unfortunately simplehttpserver doesn't support Range.

The gateway server (based on net/http) seems to support it though (Accept-Ranges: bytes), so maybe worth writing a simple handler?

parkan commented 5 years ago

re: command syntax, are you imagining using --url in place of --nocopy?

kevina commented 5 years ago

@parkan, it won't replace --nocopy. --nocopy will be used when adding files and --url will be used when providing a url.

I can probably write an http server in go fairly quickly. I am asking in case we already have something in place. cc: @whyrusleeping @Stebalien

Stebalien commented 5 years ago

@kevina A --url flag sounds fine. However, for your consideration...

My ideal solution would be to continue using --nocopy but allow URL arguments. That way, ipfs add https://ipfs.io would mirror the ipfs homepage in IFPS and ipfs add -r https://ipfs.io would mirror the entire site. When --nocopy is supplied, we'd just use the filestore instead of adding the blocks to the datastore. For now, we could just return an error when a user passes a URL without specifying --nocopy.

However, I'm not sure how well the commands lib will deal with "file or URL parameters". Thoughts?

kevina commented 5 years ago

@Stebalien a url has no structure, we would have to reimplemented a web crawler for a recursive add to work. Or am I missing something?

Stebalien commented 5 years ago

a url has no structure, we should have to reimplemented a web crawler for a recursive add to work. Or am I missing something?

You're not missing anything. Recursive adds of URLs would be a distant future feature. However, it would make a lot of users very happy.

kevina commented 5 years ago

Added test cases, @Stebalien is this about what you had in mind from out IRC conversation.

kevina commented 5 years ago

However, I'm not sure how well the commands lib will deal with "file or URL parameters". Thoughts?

Considering that @parkan wants to try to get this in ASAP I think, for this p.r. I will see if I can make the --url flag work quickly (and if not keep the current special command) and save the rest for latter.

kevina commented 5 years ago

Actually adding a --url to add will be involved since the add command expects file arguments. Thus a rework of the commands library or at least the add command will be needed. One option is to make url a sub command of add. That is ipfs add url <URL> .... This will allow the command to recognize the same flags, but this will still require some work.

kevina commented 5 years ago

Also it seams that the Blockservice has no way to return errors if it can't retrieve a block. This is causing one of my tests to fail and I am not sure how best to fix it.

Stebalien commented 5 years ago

One option is to make url a sub command of add. That is ipfs add url .... This will allow the command to recognize the same flags, but this will still require some work.

That would break:

> touch url
> ipfs add url

We could add a separate add-url command but we might just want to add a ipfs filestore add command. We could also rename filestore to something more generic but that may not be worth it.

whyrusleeping commented 5 years ago

For now, let's just add an ipfs filestore add command, fine if it only supports URLs for now. Later we can rename (alias) filestore to some better name. Augmenting ipfs add to support URLs will be really nice to have later.

kevina commented 5 years ago

That would break:

I didn't think of that.

We could add a separate add-url command but we might just want to add a ipfs filestore add command.

Any separate command to add files will require some refactoring of the add command in order to avoid duplicating too much code and to recognize the same flags. Since @parkan wants this in quickly I am thinking we should just stick with what we have now (a very simple ipfs urlstore add) and label the command as temporary and explain it is likely to go away or the semantics change once a better solution is found.

A proper refactoring of the add command will likely take me a few days, but them its needs to properly reviewed which will could take weeks unless we make this a high priority and I can get rapid feedback.

kevina commented 5 years ago

@whyrusleeping I was about to ask you the same thing. This should be ready for review then. The only concern is the failing test, but I am not sure how to fix that right now as it a problem with the blockservice.

whyrusleeping commented 5 years ago

@kevina what exactly is the problem with the blockservice?

kevina commented 5 years ago

@whyrusleeping it does not seam to return any sort errors when some blocks can not be found. This test (https://github.com/ipfs/go-ipfs/pull/4896/files#diff-d911cc83ed0b333bf4b050510b334ba3R102) is failing when it should not. I have verified that the blocks are not available and yet the client is not reporting an error:

test_expect_failure "files can not be retrieved via the urlstore" '
  test_must_fail ipfs get $HASH1 &&
  test_must_fail ipfs get $HASH2
'

Interesting it also seams to not write any output. Just returns success without writing anything. I have not been able to track down the exact cause, but one problem is the interface to the blockservice does not provide any way to return errors when blocks can't be found.

whyrusleeping commented 5 years ago

@kevina does it still fail that way if you use ipfs cat?

kevina commented 5 years ago

@whyrusleeping

@kevina does it still fail that way if you use ipfs cat?

No it seams ipfs cat fails. So somewhere some error code is getting dropped but I have not figured out where yet. Any ideas?

whyrusleeping commented 5 years ago

@kevina I suspect you found a bug in ipfs get. Likely in how the unixfs/archive code handles batch get failures. I'd say just use cat for now, and let's look into the get issue separately

parkan commented 5 years ago

nice, @kevina! will review this first thing tmw

parkan commented 5 years ago

@whyrusleeping can you sign off?

kevina commented 5 years ago

@whyrusleeping should I rebase this now or wait until 0.4.16 lands (to avoid having to rebase it a second time)

whyrusleeping commented 5 years ago

@kevina I'd just wait until 0.4.16 lands. I don't think anything else is going to land between then and now, but just in case.

parkan commented 5 years ago

merging first thing after 0.4.16 sgtm -- ideally this would be before the dev days (planning to demo there and would prefer to do it from master)

thanks again @kevina!

kevina commented 5 years ago

@parkan your welcome. I have this rebased within a day after 0.4.16 lands. If it is before the dev conf, well that depends on @whyrusleeping.

kevina commented 5 years ago

@whyrusleeping @parkan just rebased now that it looks like we finally got a release out.