internetarchive / dweb-transports

GNU Affero General Public License v3.0
25 stars 16 forks source link

Add DAT #6

Open mitra42 opened 6 years ago

mitra42 commented 6 years ago

should support DAT protocol in dweb-transports, this should be relatively straightforward.

Note, this won't (currently) work in the browser due to WebRTC issues, but should work in Node (e.g. in dweb-mirror).

See DAT meta: https://github.com/mitra42/dweb-universal/issues/1

RangerMauve commented 5 years ago

Would you like help with this effort? I've been doing a lot of work in getting Dat to run in browsers.

WebRTC isn't going to be a viable way forward since there aren't any hybrid clients that support it. You're pretty much stuck with relying on using a gateway of some sort.

RangerMauve commented 5 years ago

I've got a bunch of transports working in my webrun repo which allows me to load JS modules out of IPFS, Dat, and soon WebTorrent. Might be relevant to your efforts. :)

mitra42 commented 5 years ago

Coincidentally I was just emailing Karissa about the Dat interface.

Can you say more about the WebRTC issue, I thought (but could easily be wrong) that DAT used WebRTC, and we already use WebRTC for webtorrent, so I'm not sure where the issue is.

RangerMauve commented 5 years ago

So, I don't know if WebRTC performance has gotten better, but it's still not that great if you have a lot of peer connections. Dat could replicate over DataChannels just fine for what it's worth and it wouldn't be too bad if you limited the number of peers.

The main problem is that there's no discovery mechanism that would work with both the browser and the rest of the Dat ecosystem. You'd need to do what WebTorrent did and create hybrid clients.

Also, Dat doesn't have anything similar to HTTP trackers and seeds so discovering peers and content would require additional work.

I did a bunch of work with my fork of dat-gateway and dat-archive-web, but it's still pretty rough.

@jimpick has done a bunch with regards to getting Dat to work in the browser so maybe he can give more insight

mitra42 commented 5 years ago

Yes - I think that was the problem in IPFS's use of WebRTC, it didn't distinguish between browser and node, and uses a DHT relying on a lot of connections (rather than what I think is needed which is a few connections from low-bandwidth/power nodes to well connected nodes that can open lots of connections.

RangerMauve commented 5 years ago

Dat is moving to a new networking layer called hyperswarm soon and that's also DHT based. So I'm not sure what to do for browsers with relation to that. Maybe @mafintosh or @pfrazee would have better ideas, but I don't think browser support is really on the agenda at the moment.

Gateways seem like the most surefire way to get things to work.

okdistribute commented 5 years ago

Hey, thanks @RangerMauve for the input! You're right, WebRTC isn't the best, but for the use case here I think it'll be just as good as any other of these transports (as @mitra42 pointed out, IPFS runs up against the same issue), which seems good enough for now. So it's worth giving it a shot.

The transports also support a websocket gateway connection, which also is supported in dat-daemon and there's a roll-your-own example using discovery-swarm and websocket-stream here: https://github.com/jimpick/dat-shopping-list

okdistribute commented 5 years ago

I recently stubbed it out but I am not sure that I really have time this week to take a hard look at it, if anyone else wants to get involved please do. https://github.com/datproject/dat-js/pull/13

RangerMauve commented 5 years ago

Yeah, I'm interested in helping out Wednesday some time after 18:00 EST (GMT -5).

Just to note, I've got the DatArchive API from Beaker working in the browser using dat-archive-web and a gateway server. It's limited in that it needs to work with a Dat archive rather than just a hypercore, and you end up sharing all your data with the gateway.

I've also done some work on discovery-swarm-stream which implements a subset of the discovery-swarm module that works over websockets to a gateway. This is more experimental, though.

RangerMauve commented 5 years ago

I see that you have a GUN superpeer running on dweb.me, would we be able to do the same thing for Dat with discovery-swarm-stream?

okdistribute commented 5 years ago

@RangerMauve yeah I think that's the idea.

RangerMauve commented 5 years ago

So, should we focus on getting dat-js working with the gateway, or would it make sense to get right into making the transport work for dweb-transports?

RangerMauve commented 5 years ago

My reasoning is that focusing on the transport would be faster since it's more constrained, though getting it to work with dat-js first would be better for the ecosystem.

mitra42 commented 5 years ago

Sorry for a slow response I’m on semi-vacation till tomorrow.

We have a gun super-peer but the browsers talk Gun to it, then it talks directly to Archive’s servers. This way we can map the archive items into the Gun namespace. I think that is the right approach for Day especially since it reduces single point of failure but I don’t know enough about current Dat limitations to be sure.

Mitra Ardron +15204231767 (voice, sms, WhatsApp). Australia till 28 Feb +61491082515. Bali 20-30 Jan WhatsApp only. Mitra@mitra.biz

On Jan 29, 2019, at 5:06 AM, RangerMauve notifications@github.com wrote:

My reasoning is that focusing on the transport would be faster since it's more constrained, though getting it to work with dat-js first would be better for the ecosystem.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

RangerMauve commented 5 years ago

In the dat ecosystem dat-gateway uses a thing called dat-librarian (by @garbados) to keep track of a set of archives. We could adapt that to work with whatever method the internet archive uses for storing these things. Basically, you just need a server speaking websockets, then a browser connects to a WS URL with the dat public key or DNS domain, then they both start a replication stream for the archive. The server will also reach out to the P2P network to find peers that have the archive in order to get updates.

The WS code handling that is here. There's an example of using a hyperdrive (in the browser with browserify) to load data in the README

RangerMauve commented 5 years ago

K, looking into this now.

I'm going to:

After this is working and I've got some reviews we should get started on the transport and get an instance of discovery-swarm-stream running in the Internet Archive's servers.

RangerMauve commented 5 years ago

Actually discovery-swarm-stream might be overkill. 😅 I'll use dat-gateway instead.

okdistribute commented 5 years ago

Hey @RangerMauve ! I think it would be nice to separate the dat-js code from the server code.

okdistribute commented 5 years ago

Thanks for doing this work, would be really slick to have something simple working with dat-js!

mitra42 commented 5 years ago

Karissa - I'm trying to relate what @RangerMauve is saying with our conversation earlier, could you do the mapping :-)

okdistribute commented 5 years ago

Yeah, will do @mitra42 !

RangerMauve commented 5 years ago

Roger on not including the server in dat-js.

It's actually super simple with dat-gateway. Just opening a websocket inside _createWebsocket to the gateway and piping it into a replication stream.

Gonna get my public gateway running again tonight to test it out.

okdistribute commented 5 years ago

awesome thanks! excited to see this work being done. It seems like in dat-js we would like to have both webrtc and websocket discovery methods -- to the user of dat-js, they can give a url and perhaps a list of peer introduction ip addresses (websockets or webrtc), and get back the data.

We can think about how to make it easier to deploy the right kind of server software that would be needed as a separate repository that dat-js links to. I like the idea of building off of existing stuff, with dat-gateway and dat-librarian.

RangerMauve commented 5 years ago

Sounds good. Right now I'm assuming there's a single server URL. Were you thinking of passing in a list of WS servers in the future?

Also, you don't necessarily need all of dat-gateway to get the ws replication going. Just dat-librarian or something similar with a way to route websockets would work.

okdistribute commented 5 years ago

Great to hear! I guess webrtc takes a list of signalhubs, but we can just start with one websocket. I think in the future it might be nice to be able to provide fallbacks in case one is unreachable but that is out of the scope at this point. Could also be done in application code

mitra42 commented 5 years ago

For the other transports, I separate it into two steps - a "connection" step , which would get passed one or more webrtc or wss addresses, and then separately the request which is the URL of the file being requested

RangerMauve commented 5 years ago

That might be a hard to work with depending on how the data you're trying to access is structured.

dat-js needs to have a dat archive key before it can reach out to the network since it needs that to establish connections. If the internet archive is storing all it's content within a single dat archive, this shouldn't affect things much.

Regarding the websocket thing, what about having it act as a fallback for WebRTC the way WebTorrent has HTTP seeds. I propose that you attempt to find peers through WebRTC, and if you don't get them within a timeout, then you reach out to a gateway to get the data there.

That way you have less load on gateways since most of the data will be P2P and you also avoid having to do the WebRTC dance on the servers acting as seeds. Getting WebRTC to run in Node is a bit of a pain last I checked.

RangerMauve commented 5 years ago

I don't think that it's worth going all that way for the MVP though. 😅 This is more an idea for higher salability in the future if this stuff is used more often

mitra42 commented 5 years ago

FYI WebRTC in node is working fine for me. I don't quite understand what you are saying - the steps are to do that dance - of connecting to WebRTC or WebSockets at the first step, that could take a key for the Archive's DAT if thats the right way to find the connections.

RangerMauve commented 5 years ago

That's good about the WebRTC. I haven't used it in node for a few years so I guess it's gotten a lot better. 😅

Looking up WebRTC peers and connecting to them is a bit more computationally expensive and harder to load balance compared to websockets.

Yeah, if you have the key when connecting it works well. It's just that having multiple keys requires setting up multiple connections. Like, in IPFS you bootstrap into the network and you're good to go, but with dat-js you essentially have a separate network per archive.

mitra42 commented 5 years ago

@RangerMauve - and that creates another reason why sharding at the item level is a fairly bad level, because every item will require a new set of connections, and for example - in an ideal case, you'd get thumbnails from DAT as well, which would be connecting to approx 70 DATs just to display the Collection page since their thumbnails are in the Items.