restic / rest-server

Rest Server is a high performance HTTP server that implements restic's REST backend API.
BSD 2-Clause "Simplified" License
943 stars 140 forks source link

Broken deduplication with REST --private-repos #172

Closed Gaibhne closed 2 years ago

Gaibhne commented 2 years ago

Output of rest-server --version

[joerg@braemar ~]$ docker images restic/rest-server:latest -q 3598be715f32

I do however not believe that the above command shows anything useful/applicable to my case; I built restic-rest from a fresh repository checkout.

How did you run rest-server exactly?

Via docker-compose:

---

version: '3'

services:
  restserver:
    build: rest-server #contains this repository with no changes as of the day of posting this
    volumes:
      - /mnt/usb-01/restic:/data
    environment:
      OPTIONS: "--private-repos --path /data"
    ports:
      - "3000:8000"

The reason I build my own is because I run on Raspberry Pi 4 and there are no arm images on Docker hub.

What backend/server/service did you use to store the repository?

Regular USB HDD with btrfs.

Expected behavior

I would expect deduplication to happen when two users back up the same data to the repository.

Actual behavior

No deduplication happens. After the second user backs up their data, the root folder is twice the size.

Steps to reproduce the behavior

I followed the instructions from the repo to set up Restic REST with multi user support with no special steps or deviations.

Do you have any idea what may have caused this?

No. It looks like instead of having one repo with multiple users, I just have multiple repos as if I was running individual servers.

Do you have an idea how to solve the issue?

No

Did rest-server help you today? Did it make you happy in any way?

Cool idea, but I am still in the initial steps of figuring out how to set it up before ever actually using it, so I have yet to experience anything exciting :(

rawtaz commented 2 years ago

If you have e.g. two users backing up to each their own repository, then there will naturally not be any deduplication between those two repositories. Each repository is its own "world" and has nothing to do with each other. Deduplication works within a repository. Can you verify that this is what you're doing (that is, each of your two users are backing up to a different repository URI)?

Gaibhne commented 2 years ago

I'm not sure how to tell whether it is one or two repositories. I wouldn't think there would be a reason for two users backing up to the same REST server to not be using the same repository ? What would be the point of that ? User A already cannot see snapshots from other users, why would storage be also split up and break deduplication on top of that ?

My test users used commands like so:

restic -r rest:http://userA:abcd@raspi:3000/userA/ stats
restic -r rest:http://userB:efgh@raspi:3000/userB/ stats
rawtaz commented 2 years ago

I think you are not entirely clear on what REST-server actually does or how it's designed. Basically speaking, it's just a REST frontend between your filesystem and restic.

So when you use the two repository URLs you wrote above, it's like you'd use two different folders in your filesystem, for the repositories - e.g. -r /data/userA/ and -r /data/userB. In this example the same thing would apply - they're two separate repositories and thereby they don't have any deduplication between each other. Same thing with REST-server - it's just adding a REST API between restic and your filesystem.

Put another way, REST-server just adds a transport mechanism such that you can have your repositories be in a remote filesystem instead of local one. I'm simplifying a bit here, but trying to convey the situation in your case.

I'm not sure how to tell whether it is one or two repositories.

Given your repository URIs they are two separate repositories - only if the URI was the same you'd access the same repository. You can also check the filesystem, on the server - you'll see the userA and userB folder and that they both have their own repository files in them.

I wouldn't think there would be a reason for two users backing up to the same REST server to not be using the same repository? What would be the point of that?

If you have multiple users, why would you set up multiple servers? It makes no sense. You don't set up multiple SSH servers just because you have multiple users accessing your system over SSH.

User A already cannot see snapshots from other users

Because you are using separate repositories. If you want deduplication to apply to multiple users' backups, you need to store those users' backups in one and the same repository.

why would storage be also split up and break deduplication on top of that?

Because deduplication happens within a repository. If you have multiple repositories, deduplication doesn't happen between those repositories. Try to think of how it works when you have multiple separate repositories in a local filesystem.

Gaibhne commented 2 years ago

Ah, that is very disappointing to hear. I had hoped that REST with its user system through .htaccess would allow me to control access to repositories, but it sounds like it merely allows me to serve different repositories to different users, like chrooting or jailing with SSH, instead of showing each users only their files, but within the same file system.

This seems like a huge missed opportunity if I understood correctly; deduplication really shines when it comes to multiple users (i.e. everyone backing up their drive containing a near-identical Windows install), but obviously with no restrictions at all, allowing every user to simply download everyone else's files, that isn't really a reasonable setup for anything but the most close and intimate family, and even then I would be hesitant.

I have no insight into the internals, but I would think with the REST server knowing the user, it should be relatively trivial for the rest server to expose only the snapshots within a singular repository from that user, allowing both user access control as well as deduplication across multiple users.

It sounds like what I thought was a bug turned out to be a feature request, though with a project this established I suspect that if it was a feature that was being considered it would have been added a long time ago; if anything, I would have thought that kind of multi-user support would have been a feature from day one. If I am wrong, please let know if I should make a separate 'feature request' issue along these lines.

rawtaz commented 2 years ago

Yeah it seems you are thinking more in terms of a "repository server" than just a transport mechanism. REST-server does have access control to some extent, but there's of course different types and levels of that, and what you expected isn't what REST-server offers.

I think that what you are requesting isn't as simple as it may sound. All the deduplication and encryption happens on the client side, so for this to happen across multiple snapshots the client will have to be able to read information from e.g. other snapshots that it shouldn't have access to. Other can explain this much better than I can.

There's the --append-only feature which would let you have multiple users in the same repository, thereby making use of deduplication and not letting users delete snapshots (their own or other users), but it still lets them read others' snapshots in the same repository.

Honestly I don't think there's a need for a feature request at this point.

rawtaz commented 2 years ago

What's your specific use case? Do you have a ton of clients with similar data or what's the situation?

Gaibhne commented 2 years ago

Yeah, bunch of family members who all have the same Windows, mostly the same Office, many the same games and a bunch of the same music. Obviously, there is absolutely no way to allow them to see each others files, so append doesn't really help. I can see how dedup happening client-side would make this problematic to implement, possibly even impossible.

rawtaz commented 2 years ago

Perhaps make two repos - one for you parent people and one for the kids.