Discussion: Nostrhash - Githubissues

degenrocket commented 1 year ago

nostrhash1 is a default hash used to identify media files in the Nostr ecosystem.

nostrhash1 is the SHA1 Hash over the part of a torrent file with pre-agreed algorithms used to generate the piece length and a file name to ensure consistency of a generated hash.

nostrhash256 uses SHA256 instead of SHA1.

Since original idea of using one infohash to identify a media file has been expanded to support multiple hashes (See #3), it's important to not only list all existent supported hashes, but also achieve consensus on a naming convention.

As mentioned in the original documentation, torrent's infohash doesn't generate a consistent hash because the computed hash depends on a file name and torrent file piece length. The solution is to achieve consensus on a file name and on the algorithm used to generate the piece length to ensure that infohashes stay consistent (see documentation).

Thus, it makes sense to use another name for the hashing algorithm that will produce a consistent hash even if a file will have different names.

The proposed names are nostrhash1 for SHA1 and nostrhash256 for SHA256 versions of the algorithm.

To discuss:

Should we create a name for the hashing algorithm used to identify media files or should we just keep infohash name?
What's the best name for the new hashing algorithm?
Should a new hashing algorithm support SHA256 or only SHA1?
What's the best approach to deal with a file name to ensure consistency of generated hashes?

lovvtide commented 1 year ago

I think there are two issues here.

The first issue is standardizing the filename and piece length. The second issue is distinguishing between the original BitTorrent and BitTorrent v2 (https://blog.libtorrent.org/2020/09/bittorrent-v2/).

There are actually a few differences in the v2 spec that I think will need careful consideration to support. But even with those changes, the piece length and the filename still need to be standardized.

So for the first issue, yes, we need to define what the filename and the piece length should be but I don't see why we need to give that standardization scheme a name. We can just say in the NIP, "the piece length should be computed according to this algorithm" and "the filename should always be " (and that applies to both v1 and v2 torrents)

For the second issue (distinguishing between v1 and v2) my idea was to just use the marker torrent to refer to a regular BitTorrentV1 torrent (since those are most widely supported) and if/when we want to support v2 we can define a marker like torrent_v2. For the sake of simplicity, I suggest that we just support torrent v1 for now and call it torrent

lovvtide commented 1 year ago

Another point is that the client does not actually need to know how to construct torrents to be able to use the infohash to download the data. It's only relevant when creating torrents. And the whole point of standardizing filename and piece length is just so that the same file will not be unnecessarily duplicated. When the client connects to trackers, it will download the metadata for the torrent (including filename and piece length, whatever it is) so actually the piece length does not need to be in the event at all, it just needs to be in the NIP so that developers who are building clients that can create torrents end up generating the same infohash for the same file.

lovvtide commented 1 year ago

On the last question

What's the best approach to deal with a file name to ensure consistency of generated hashes?

If we're going to standardize the filename, maybe it should just be the pubkey (hex encoded) of the author. One potential benefit of this is that now it might be possible to search filenames for that pubkey on other torrent trackers that don't know anything about nostr (but then again someone could lie and just create a torrent named for someone else's pubkey!)

One drawback I see of using the pubkey as the name is that now clients must know the pubkey in advance in order to create a torrent. This shouldn't be a problem for nostr clients, except it does make it impossible for non-nostr clients to create a nostr-standardized torrent because all their users may not have a nostr identity.

There is also a drawback that I see to standardizing the filename at all: non-nostr regular torrent clients who download the file will not know what to name it and won't be able to use the file extension like .mp4 in the filename to know how to open the file.

The only benefit to standardizing the filename is to avoid duplicating data. I wonder if that's really something we need to be worried about. What's the incentive for people to create duplicate torrents anyway? Won't people mainly just be publishing their own files?

Maybe it would be best to just require that the filename used to create the infohash is the exact same string as the value of the name tag in the event, but otherwise let the client choose the string. I think it would probably be the best UX even though it comes at a slight potential cost to effieciency.

degenrocket commented 1 year ago

For the second issue (distinguishing between v1 and v2) my idea was to just use the marker torrent to refer to a regular BitTorrentV1 torrent (since those are most widely supported) and if/when we want to support v2 we can define a marker like torrent_v2. For the sake of simplicity, I suggest that we just support torrent v1 for now and call it torrent

OK, let's use torrent for SHA1 and torrent_v2 for SHA256 instead of infohash.

When the client connects to trackers, it will download the metadata for the torrent (including filename and piece length, whatever it is) so actually the piece length does not need to be in the event at all, it just needs to be in the NIP so that developers who are building clients that can create torrents end up generating the same infohash for the same file.

Good point. However, it's important that both client and host verify an infohash, so can you think of any scenarios when adding piece length to an event is actually necessary for security reasons? For example, an advanced user might want to embed a torrent file that has already been hashed before using a different algorithm to generate the piece length.

The only benefit to standardizing the filename is to avoid duplicating data. I wonder if that's really something we need to be worried about. What's the incentive for people to create duplicate torrents anyway? Won't people mainly just be publishing their own files?

Some users will re-upload viral videos, so there might be many different torrents for the same exact media file if we don't standardize the file name.

How about setting a name to nostr-media-file? That way even non-nostr users will be able to search for wtf is nostr-media-file and learn how to open it.

Can you think of any serious drawbacks of having thousands/millions of torrent files with the same name?

lovvtide commented 1 year ago

Good point. However, it's important that both client and host verify an infohash, so can you think of any scenarios when adding piece length to an event is actually necessary for security reasons? For example, an advanced user might want to embed a torrent file that has already been hashed before using a different algorithm to generate the piece length.

The client will always verify the infohash, and the hash of each piece individually. If a user wants to recreate a torrent with a different piece length of course that will result in a different infohash (and different hashes for each piece) but the client will still verify those hashes. I don't think I fully understand what you're asking with regard to this being a security issue. I only see it as an issue of preventing unnecessary duplication.

Some users will re-upload viral videos, so there might be many different torrents for the same exact media file if we don't standardize the file name.

True... but realistically, the the vast majority of clients that are going to be uploading "nostrified" torrents are going to be following the NIP. So if the NIP requires that the name of the filename used to compute the infohash is equal to the value of the name tag then any nostr client from which a user is republishing a torrent for some reason can be expected to automatically set the name to the same value.

Can you think of any serious drawbacks of having thousands/millions of torrent files with the same name?

The main drawback I see is precluding use cases where some app expects the filename from the torrent metadata to be meaningful/human-readable. Like for example, maybe someone wants to create an app that downloads podcast episodes into a local folder. We don't want the user to end up with a folder where every file is named the same thing. The client could rename the file upon downloading it to be consistent with the name in the nostr event, but what if the client only knows about the torrent based on the metadata is receives from the tracker? Also—this dependency on the filename being a certain value is only a peculiarity of how torrent infohashes are computed. IPFS hashes for example don't care about the filename at all. Assuming that in the future their may be many more options for content-addressing the same file, it seems like a bad idea to restrict the filename just because of how infohashes work. We could make it very clear in the NIP that clients who are generating torrents SHOULD following the convention that name tag === filename, and if some client does not want to follow that then, well, the torrent will still work, they just won't be able to take advantage of existing peers/webseeds. There's nothing we can do to force clients to following any naming convention—if that want to ignore name tag equality requirement they could just as easily ignore the nostr-media-file requirement. So I think we should just go with making it equal to the name tag because that's going to be better UX.

degenrocket commented 1 year ago

:heavy_check_mark: OK, let's keep the torrent name equal to the name tag in the media event (kind 2001).

We can also provide an alternative approach of having a hardcoded nostr-media-file name in the discussion to the NIP in order to get feedback.

degenrocket commented 1 year ago

The client will always verify the infohash, and the hash of each piece individually. If a user wants to recreate a torrent with a different piece length of course that will result in a different infohash (and different hashes for each piece) but the client will still verify those hashes. I don't think I fully understand what you're asking with regard to this being a security issue. I only see it as an issue of preventing unnecessary duplication.

What about backward compatibility if the algorithm to generate an optimal piece length will change in the future? Can it cause any problems? As I remember, that was a major reason to add a plen tag to a media event.

If we introduce a new hash name like nostrhash or nostrhash1, then we can omit a plen tag, since the algorithm will be a part of a nostrhash spec. Any change to the algorithm will lead to a new name like nostrhash_vXX.

However, if we just keep a torrent hash, wouldn't it be better to keep an optional plen tag inside a media event as it was specified in the original documentation to ensure backward compatibility?

lovvtide commented 1 year ago

However, if we just keep a torrent hash, wouldn't it be better to keep an optional plen tag inside a media event as it was specified in the original documentation to ensure backward compatibility?

Yeah let's keep the optional plen tag. If the tag is omitted the client can assume that the piece length is given by the algorithm in the spec. That we preexisting torrents with a non-standard piece length can be added to nostr easily.

lovvtide / nostr-torrent

Discussion: Nostrhash #4