Improve docs for image proxy cache

josecelano commented 6 months ago

Relates to: https://github.com/torrust/torrust-index-gui/discussions/519

Images in the torrent description only show if the user is logged in. If the user is not logged in, you see something like:

Images are served using a proxy cache to preserve the user's privacy. If we served images directly from the original URL, a malicious user could track other users' activity by tracking served images.

Since the proxy takes a lot of resources from the server, we decided to assign a quota per user that you can config in the config file:

[image_cache]
max_request_timeout_ms = 1000
capacity = 128000000
entry_size_limit = 4000000
user_quota_period_seconds = 3600
user_quota_bytes = 64000000

We also need to identify the user to update the quota.

Docs for image configuration: https://docs.rs/torrust-index/3.0.0-alpha.2/torrust_index/config/struct.ImageCache.html

We should add this explanation to the User Guide and also to the crates docs.

Maybe we should also add a link to the User Guide in the application.

hungfnt commented 6 months ago

If we don't want unregistered users to see the images, should we handle it on the API side instead of the GUI side? The API will return sanitized descriptions.

If we still want to handle it on the GUI, I think we can sanitize all the descriptions in the list on page load or in the background, and display the sanitized descriptions.

josecelano commented 6 months ago

Hi @ngthhu,

Each option has its pros and cons.

If we don't want unregistered users to see the images, should we handle it on the API side instead of the GUI side? The API will return sanitized descriptions.

Cons:

The user would not know the original URL. We want to preserve privacy, but with the current solution, the users can just copy/paste the original URL and load it if they want.

The endpoint contains the original URL:

{
    "torrent_id": 2,
    "uploader": "admin",
    "info_hash": "443c7602b4fde83d1154d6d9da48808418b181b6",
    "title": "Ubuntu",
    "description": "Ubuntu\n\n![Ubuntu logo](https://logos-world.net/wp-content/uploads/2020/11/Ubuntu-Logo-2004-2010.png)",
    "category_id": 5,
    "date_uploaded": "2024-03-14 07:55:23",
    "file_size": 4932407296,
    "seeders": 4,
    "leechers": 0,
    "name": "ubuntu-23.04-desktop-amd64.iso",
    "comment": "Ubuntu CD releases.ubuntu.com",
    "creation_date": 1681992794,
    "created_by": "mktorrent 1.1",
    "encoding": null
}

The server must always sanitise the description field in any SQL query. If you forget to sanitize the field in one case, it could lead to some holes. That could also happen in the front end, but we use a Markdown component that always sanitizes the description. It could happen if we use the description with another component.

The proxy API context would be coupled to the rest of the API. This doesn't have to be a problem. Just to consider it. Now, the frontend is coupled to the image proxy.
The server has more work to do. For me this is the main advantage.

Pros:

The client must have JS disabled, and the browser could load the original URL.
Do you know any advantages of doing it on the backend?

If we still want to handle it on the GUI, I think we can sanitize all the descriptions in the list on page load or in the background, and display the sanitized descriptions.

We are sanitizing the description in the Markdown component.

Cons:

We have to sanitize even if the data we fetch could not be rendered at all.
I think that could be harder to maintain because the place where you need sanitization and the place where you do it are not the same place and are not done at the same time. I mean, if you add a new endpoint with the description, you need to remember to sanitize it even if it's not used in any component (it could in the future).

Pros:

Any advantages of doing it in other places rather than on the component that is showing it?

hungfnt commented 6 months ago

I apologize for the delayed response.

the users can just copy/paste the original URL and load it if they want.

So we allow unregistered users to access the images original URLs. I thought they should not access those URLs at all.

We have to sanitize even if the data we fetch could not be rendered at all.

I agree with this. However, as a user, I will often click on a list item to expand and collapse it multiple times. It would be more efficient to sanitize the descriptions beforehand.

I think that could be harder to maintain because the place where you need sanitization and the place where you do it are not the same place and are not done at the same time. I mean, if you add a new endpoint with the description, you need to remember to sanitize it even if it's not used in any component (it could in the future).

We can store the sanitized descriptions in the torrent object every time we fetch any data containing the descriptions from the API. I apologize if I have misunderstood your idea.

josecelano commented 6 months ago

I apologize for the delayed response.

No worries.

the users can just copy/paste the original URL and load it if they want.

So we allow unregistered users to access the images original URLs. I thought they should not access those URLs at all.

Not directly. I meant you can always see what comes from the API. But maybe that could be a good idea. Maybe we could add an option in the future to show the description as plain text like when you are editing. But for now It does not make sense.

We have to sanitize even if the data we fetch could not be rendered at all.

I agree with this. However, as a user, I will often click on a list item to expand and collapse it multiple times. It would be more efficient to sanitize the descriptions beforehand.

OK, I have not thought about that. That's a good reason to do it when we fetch. However, if the page size is big, the user could only open some rows.

I think that could be harder to maintain because the place where you need sanitization and the place where you do it are not the same place and are not done at the same time. I mean, if you add a new endpoint with the description, you need to remember to sanitize it even if it's not used in any component (it could in the future).

We can store the sanitized descriptions in the torrent object every time we fetch any data containing the descriptions from the API. I apologize if I have misunderstood your idea.

I understood your idea, but I did not like changing the torrent object just because, in one view, you need to sanitize it. But now I understand your reason. But I think we need a new object (SanitizedTorrent). I think we should not change the Torrent object because, in future features, we could need the original description content. And that's an object wrapping the API response. Maybe it's also a good idea not to depend directly on the API response.

Conclusion: I agree, we can sanitize in the object we pass around with the torrent indo. However I'm not sure yet if we should overwrite the value for description. I have the feeling that we have a different object with a different responsibility.

Torrent: DTO for API response. It should just contain what the API returns.
XTorrent: DTO for the views we are using in this UI. I don't have a name yet, maybe SanitizedTorrent or we can call it Torrent anyway but in a different namespace. We can have a module which is wrapper for the API.

I hope that does not sound too complicated :-)

hungfnt commented 6 months ago

We can store the sanitized descriptions in the torrent object every time we fetch any data containing the descriptions from the API. I apologize if I have misunderstood your idea.

My idea was to process and add a field called sanitizedDescription to the object on the UI side after receiving the object from the API, while retaining the original description field.

I don't know if it's redundant when that process is handled by the API side.

josecelano commented 6 months ago

We can store the sanitized descriptions in the torrent object every time we fetch any data containing the descriptions from the API. I apologize if I have misunderstood your idea.

My idea was to process and add a field called sanitizedDescription to the object on the UI side after receiving the object from the API, while retaining the original description field.

I don't know if it's redundant when that process is handled by the API side.

Okay, I'm starting to think we should do it in the backend and frontend, even if it's redundant. The client might have JS disabled, and it could load the original image. In that case, I would only use the description field. We could assume that the field is always sanitized.

If we sanitize on the backend we could do it:

Before persisting the torrent
After getting the torrent from the DB

We must replace image URLs with the proxied ones when we fetch the data because the proxy URL could change.

I guess I've changed my mind and now I totally agree with what you commented here :-)

The user would not know the original URL. We want to preserve privacy, but with the current solution, the users can just copy/paste the original URL and load it if they want.

What I said is not true. The proxied image URL contains the original image URL, so the user can see the image anyway if we sanitize it in the backend.

I like this option because we minimize variables containing wrong values and it's simpler. My problem was I had in mind the idea that sanitization should not be done in the backend. But I was probably confused with this other concept:

https://benhoyt.com/writings/dont-sanitize-do-escape/

What you should not do is validate the input. You should store the original input and sanitize depending on the output context.

Conclusion

Do validation in the frontend the way you proposed: after fetching the data.
Potentially also sanitize in the backend in the future.

NOTES:

We can use the same field since we could also add validation in the backend.
The images could already be proxied images (when we fetach data from the API) so we have to avoid double proxy.

hungfnt commented 5 months ago

Thank you for your response @josecelano. I'm a newbie to this proxy cache thing and sanitizing. My suggestion is mostly from the user's point of view.

josecelano commented 5 months ago

Thank you for your response @josecelano. I'm a newbie to this proxy cache thing and sanitizing. My suggestion is mostly from the user's point of view.

That was a good suggestion. UX is very important, and waiting for the sanitiser every time you expand an item it's a bad experience.

torrust / torrust-index-gui

Improve docs for image proxy cache #522

Conclusion