Determine the number of unique files downloaded from the Linksharing service

storj / edge

Storj edge services (including multi-tenant, S3-compatible server to interact with the Storj network)

GNU Affero General Public License v3.0

48 stars 18 forks source link

Determine the number of unique files downloaded from the Linksharing service #342

Closed ferristocrat closed 11 months ago

ferristocrat commented 1 year ago

Goal

We want to determine the number of unique files downloaded from Linksharing service
This will help us better determine the cost/effort to doing something like checking a hash (or something similar) against known malicious content

Acceptance Criteria

-How many unique files are downloaded in a given timeframe?

We can do whatever is easiest/makes the most sense... for example, if we have the data, we could use the average over a given period such as a month -- if we don't have this data, we could just check the past month

Links

amwolff commented 1 year ago

I think we should use eventkit for this.

amwolff commented 1 year ago

From sprint planning: we can do this right now by using the uri that we currently still send (will stop in the next release) from linksharing on object downloads for filename-based uniqueness (we can do name+size for better uniqueness). After the next release, we should probably switch to linksharing calculating the hash of the object being downloaded and sending it to eventkit.

jtolio commented 1 year ago

FYI - All uplink clients report a path checksum through eventkit, so we do have a unique id that is path specific. That isn't going away next release to my knowledge.

In terms of hashing the content of the data and then preserving it with eventkit, that feels like a pretty sizable privacy violation. I'm okay if we check hashes with virustotal for certain linksharing uses and then throw those hashes away, but I don't think we want to preserve the content hash for all possible accesses in something durable like BigQuery. We've been pretty careful not to keep that type of data anywhere.

jtolio commented 1 year ago

If we do want to hash the content, I would recommend salting it in some way with the project salt, or the combination of the project salt, bucket name, and encrypted path.

The reason for this is we don't want the same content across multiple different user accounts to be comparable, otherwise all I need to do to find out who has a specific file is to upload it myself and then breach the database that has the assignment of who else has it.

However, if we don't have cross project comparability (I don't think we should? or at least, not that we preserve anywhere, even in analytics), then I'm not sure content-based hashes gain us much beyond what a path based hash would do.

Is the path checksum sufficient?

amwolff commented 1 year ago

In terms of hashing the content of the data and then preserving it with eventkit, that feels like a pretty sizable privacy violation. I'm okay if we check hashes with virustotal for certain linksharing uses and then throw those hashes away, but I don't think we want to preserve the content hash for all possible accesses in something durable like BigQuery. We've been pretty careful not to keep that type of data anywhere.

I agree. I think the path checksum is sufficient for an estimation. I don't think we need the precise value.

BTW I didn't know we had the path checksum (let's use it instead!). The thing that's gone away in the release is linksharing reporting URIs.

cam-a commented 12 months ago

https://console.cloud.google.com/bigquery?project=storj-data-science-249814&ws=!1m4!1m3!8m2!1s904335197794!2sb397436dbe904635beace2a13db6337c