go-ipfs does not store filesize on symlinks

Gozala commented 2 years ago

Looks like go-ipfs omits filesize in unixfs protobuf when you add ipfs add mysymlink e.g. see QmPZ1CTc5fYErTH2XXDGrfsPsHicYXtkZeVojGycwAfm3v but UnixFS.prototype.marshal does which results in different hashe.

lidel commented 2 years ago

this looks like a bug, we prob. dont need to store size:

size of symlink itself is meaningless
size of destination could lead to bugs, because destination could change

unless there is a rationale for keeping it, my vote is to fix js-ipfs to do what go-ipfs does omit filesize

lidel commented 2 years ago

@Gozala mind opening PR to fix this?

john-heinnickel commented 2 years ago

I've not been able to figure out how to communicate the presence of symbolic links to the js-ipfs-unixfs importer because there does not seem to be any examples of same in the README documentation. It sounds like there is an implementation to be found if I go looking around through the source tree, but that is time consuming an error prone on the downside.

I am having a little trouble predicting how symbolic links will be formed in a way that maintains reference semantics symmetry with a host system in light of changing content... In the native host system filesystem, a UnixFS view is patterned after, it is possible to change a file's contents without breaking links to that file, and it is possible to rename files such that symbolic links will break. Neither of these effects requires changing anything about a Symlink itself.

The options for collecting the bits that differentiate one symlink from another with regard to hash computation would seem to require either using the original source filesystem's name path, or a name path in terms of CID traversal taken from the UnixFS analogs of such nodes. Here we have some apparent problems with either scenario:

UnixFS does not seem to include name information for files and folders from the root of a filesystem. If I import the same file multiple times, I get the same CID, regardless of what I've named it. The UI client does seem to be maintaining source name information from these nodes somewhere, but apparently not in the model. The names for children of directories are apparently of semantic value to the directory nodes by their use in Link nodes--renaming children from the root directories, downward changes the CID of a directory, but as with files in the root, it does not affect the content of individual files therefore not their CID.

Is it possible to break symlinks to files in the root by renaming the linked files?

The alternative to storing symlinks with their "native" filesystem tokens would involve translating those tokens to CIDs. However, not every node in the linked path is necessarily imported, and as just discussed, moving/renaming/adding/removing children to a directory will change its CID, breaking links that those operations would not affect unless they involved the direct targets of such a link. Likewise, with respect to the target of a symbolic link, changing the content of a linked file would effectively modify its CID even if it was modified in place on the native host.

There seems to be an impedance mismatch with symlinks here. Links by reference in a source file system work precisely because filesystem names are a labeling technique that is orthogonal to file content, which is the antithesis of what IPLD's semantic model for naming is. Can these realistically co-exist?

ipfs / js-ipfs-unixfs

go-ipfs does not store filesize on symlinks #195