ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.07k stars 3.01k forks source link

IPFS permits undesirable paths #1710

Open willglynn opened 9 years ago

willglynn commented 9 years ago

IPFS path components (that is, IPFS merkledag node link names) can contain undesirable things:

Path components can also be strings that are commonly understood as path components with a special meaning, namely "." and "..".

I propose that path components (link names) be restricted both by specification and implementation, preferably by defining "valid" path components in a way that excludes the above. go-ipfs should refuse to create invalid links. go-ipfs should also refuse to process any invalid links it sees, either by discarding the link or discarding the node, and that behavior should be specified as well.

jbenet commented 9 years ago

Agreed. perhaps we should follow what linux/bsd do.

I propose that path components (link names) be restricted both by specification and implementation, preferably by defining "valid" path components in a way that excludes the above. go-ipfs should refuse to create invalid links. go-ipfs should also refuse to process any invalid links it sees, either by discarding the link or discarding the node, and that behavior should be specified as well.

Agreed.

whyrusleeping commented 9 years ago

I'm not so sure how sold i am on limiting what we can put in links. Applications (like unixfs) can and should restrict them, but it seems like we may lose out on some flexibility by doing so globally

willglynn commented 9 years ago

POSIX filenames can contain everything except NUL and /, noting that . and .. have special meaning. POSIX also defines a portable filename character set, matching only [A-Za-z0-9_\.\-], which is obviously far more restrictive. (Edit: see also the Filename Portability blurb.) POSIX pathnames are filenames joined by slashes with a length restriction, noting (emphasis mine):

If a pathname consists of only bytes corresponding to characters from the portable filename character set (see Portable Filename Character Set), characters, and a single terminating character, the pathname will be usable as a character string in all supported locales; otherwise, the pathname might only be a string (rather than a character string).

Linux permits all characters except NUL and / in filesystem interfaces. Certain filesystems add additional restrictions (returning EINVAL as appropriate), but in general Linux-native filesystems treat filenames as a NUL-terminated sequence of bytes. There's a big long (shell-centric) essay describing why this flexibility is unwanted, but again, we have it.

Besides stuff like newline and backspace and escape, the fact that Linux (and POSIX) does not require filenames to use any particular character encoding is a source of continuing struggle. See e.g. Python and Rust hammering out how to map OS-provided byte sequences to encoding-aware strings in the least terrible way possible. Requiring IPFS link names (and thus paths) to be well-formed UTF-8 would be a win, even if it means some filenames representable by the OS are not permitted in IPFS.

Speaking of length restrictions, link names should probably also be constrained to be non-empty, and have a maximum length constrained on purpose, rather than by being incidentally constrained by the network transport. Linux PATH_MAX is 4096; allowing 4096 bytes per link name (path segment) seems plenty generous.

@whyrusleeping Would you permit arbitrary byte sequences in link names? That would have interesting properties (i.e. permitting protobuf messages to be used as link names directly), but it would also mean a sequence of links could not be combined into a path using / without another layer of encoding.

jbenet commented 9 years ago
jbenet commented 9 years ago

(thanks @willglynn for bringing this up, making the case, and links, etc)

willglynn commented 9 years ago
  • previously, some users wanted arbitrary data in link names … so we can ensure that paths are human readable now.

This is exactly the discussion I was hoping to have. If link names need to be binary-friendly, then let's make them so; catch is every component would need to treat paths as [][]bytes. If not, then we've decided that paths are for humans, and machines wishing to use them to contain binary data must encode things. If paths are for humans, let's come up with rules that make it easy for humans to do the right thing with paths. This probably means making them valid Unicode (since they'll get used in contexts that use a specific Unicode text encoding) without containing surprises (like NULs and newlines).

I'm not yet sure the proposed limitations are the right ones-- i have not thought about this yet. if others can fight this out, it would save me time, and we could move faster, but if not, i'll do the research eventually.

nods

i think the names ought to be human readable, as the goal is to make a path traversal that is writable through most text (not just binary) systems, including printed paper.

In that case, we might need more restrictive rules. Leading and/or trailing whitespace might have to go. Unicode canonicalization might need to be specified – i.e. á can be encoded as one character, or as a with a combining accent – so that characters can be entered in any format but resolve to the same name. HFS+ does this, but it's easy to get wrong, prompting some vocal opponents.

we would have to address what happens to OS paths that do not make sense to import, we should have a way of escaping or representing those bytes, instead of "breaking". (we have an explicit goal of making data ingestion extremely easy, and adding from filesystem is one such path)

The basic problem is that major platforms permit filenames that are outside of Unicode. Declaring "IPFS paths shall be UTF-8 without pointy edges" is good for IPFS users and developers, even if that means some things would need a translation layer.

As for platforms: Linux is entirely laissez faire, permitting any byte string that excludes NUL and /. In practice, tools blindly assume that filenames a) are UTF-8 and b) don't contain control characters, even though the OS doesn't enforce this, resulting in many different failure modes and lots of ¯_(ツ)_/¯. Windows is ~UTF-16 and Windows filesystems share most of my proposed character exclusions, but Windows also excludes many other characters, excludes certain filenames, and can rewrite certain filenames because of MS-DOS. Also, Windows predates the invention of Unicode supplementary planes, so it's possible to end up with surrogate halves which would be excluded in my proposal as invalid codepoints. Mac OS X works in terms of valid Unicode excluding NUL and /, though the Unicode canonicalization process can silently rewrite filenames, and some of the higher layers are allergic to :.

IPFS paths should permit every filename that a reasonable person would use, while avoiding the complexities introduced by being too permissive. This is a balancing act and there's definitely room for discussion. (And again, POSIX filename portability demands only [A-Za-z0-9_\.\-], so everything past that is gravy from a UNIX compatibility standpoint.) The rules I listed initially are my best guess.

i support a max size for a component, but I do not support a PATH_MAX.

Agreed. I was advocating a maximum length per path component, rather than restricting the path in its totality, for exactly the reason that a node might be found using many arbitrarily long paths. I picked a limit of 4 KB above for comparison with PATH_MAX, but that's probably too generous; typical filesystems restrict individual filenames to 255 bytes or characters. (Edit: Linux NAME_MAX is 255.)

jbenet commented 9 years ago

@willglynn agreed throughout.

RichardLitt commented 9 years ago

Another option would be to just disallow anything in the big list of naughty strings. I'm not sure that's right, but it hasn't been mentioned yet and we don't know what people will want to do with IPFS paths. If we allow a subset of this list, we should have a public sanitizer module that allows people to check their proposed strings easily.

willglynn commented 9 years ago

That list appears to be intended to stress test input validation, and is suitable for e.g. testing tools to automatically enter data that might cause undesirable behavior, and incorporating it into an IPFS test suite somewhere might have value. However, using it as a blacklist would prohibit things like 0, 1, true, --help, mocha, _, ", ', and (space), while allowing e.g. illegal Unicode of every form that doesn't explicitly appear in that list.

Most of the rules I proposed are readily enforced by e.g. Go's unicode/utf8's utf8.Valid() or utf8_check.c, and the rest can be implemented using a similar algorithm (e.g. successive utf8.DecodeRune() calls checking against a list of excluded characters and validating length). This seems much simpler and much more effective to me.

RichardLitt commented 9 years ago

Cool. Yeah, yours is simpler. Let's go with that.

willglynn commented 9 years ago

The experience of writing some path validation code and corresponding tests prompts me to raise three new topics:

Length limits

POSIX limitations (NAME_MAX/PATH_MAX) are byte-centric, matching the byte-centric nature of the rest of the filesystem API. I'm inclined to make IPFS path segment length limitation character-centric instead, because that seems less surprising to me from a user perspective, but I can see arguments both ways. Thoughts?

Two special noncharacters

Do U+FFFE and U+FFFF warrant special treatment? These codepoints exist (and can be encoded in UTF-8 without incident), but they are defined as "not a character", along with several other codepoints both in the BMP and in other planes (e.g. U+10FFFF). A brief history on non-characters:

Versions of the Unicode standard from 3.1.0 to 6.3.0 claimed that noncharacters "should never be interchanged". Corrigendum # 9 of the standard later stated that this was leading to "inappropriate over-rejection", clarifying that "[Noncharacters] are not illegal in interchange nor do they cause ill-formed Unicode text", and removing the original claim.

Some sources state that these two non-character codepoints in particular can cause problems: U+FFFF might be interpreted as an error condition in non-specific 16-bit character I/O interfaces, and U+FFFE might be interpreted as a reversed byte order mark (U+FEFF) and prompt an undesired endianness change.

These two codepoints are specifically excluded from XML. I note that XML dates to the period when noncharacters "should never be interchanged", but I note also that only these two particular non-characters are excluded despite having the opportunity to exclude other non-characters. (Edit: HTML5 prohibits these and all other non-characters, along with most C0 and C1 control characters.)

Mostly there's a lot of discussion and no clear consensus: D, SpiderMonkey, Perl.

Official word is:

Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode 16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.

For up-to-date Unicode implementations, however, one should use caution when choosing sentinel values. U+FFFF and U+10FFFF still have interesting numerical properties which render them likely choices for internal use as sentinels, but implementers should be aware of the fact that those values, as for all noncharacters in the standard, are also valid in Unicode strings, must be converted between UTFs, and may be encountered in Unicode data—not necessarily used with the same interpretation as for one's own sentinel use. Just be careful out there!

"Just be careful out there!" isn't really great guidance.

I'm inclined to not exclude these values, since they are valid Unicode codepoints with a valid encoding and are permitted for interchange, even if they're not characters. Typical tools handling UTF-8 won't explode when handed these codepoints, so… that's okay, right?

UTF-8 has multiple definitions

UTF-8 as originally specified in ISO/IEC 10646 and in RFC 2044 can encode characters up to 31 bits over 6 bytes. UTF-8 as specified in RFC 3629 limits it to 21 bits and 4 bytes, matching the current definition of Unicode (having planes 0-16) and the range of codepoints encodable with UTF-16. Go's unicode/utf8 uses this second definition. I am in favor of restricting things to 21-bit UTF-8, and of making this explicit in the definition of what constitutes a "valid" path.

mildred commented 9 years ago

I think going to force a Unicode semantic on IPFS paths is not a good idea. The reason is:

Byte streams are much more simple to handle.

Parsing UTF-8 is of no use, especially if we must encode it back for consumption in filesystem paths or HTTP URL.

willglynn commented 9 years ago

UNIX paths are arbitrary byte sequences, and that's confusing, because programs – and programmers – usually treat them as if they're encoded strings.

Byte sequences are simple to handle until you need to do something with them, like display them in a user interface or print them to a log file, since essentially every context where text occurs outside a UNIX filesystem itself has a defined encoding that relates to Unicode in some specific way.

UTF-8 strings are exactly as simple to handle as byte sequences, and they're much simpler than byte sequences to handle when you need a UTF-8 string, or a UTF-16 string, or a string in some other encoding. Processing potentially-malformed UTF-8 strings adds one additional step – you have to validate that your string is UTF-8 (which is not hard), but once you've done that, you're left with a UTF-8 string that's valid everywhere, including contexts that blindly accept bytes like FUSE.

URLs are US-ASCII with a percent encoding system allowing arbitrary bytes. However, programs and programmers tend to work in terms of encoded strings, so just like UNIX filenames, many things assume that percent-encoded URLs correspond to valid UTF-8 and fail when they aren't. In the case of HTTP, this specifically includes ECMAScript.

Looping back, yes, UNIX paths are arbitrary byte sequences… but these bytes are to be interpreted according to the user's C locale. Typical Linux users today run UTF-8 locales, so typical Linux filenames are already UTF-8. If a user is not running UTF-8, they'll have problems: common things like glib assume filenames are UTF-8 regardless of locale, while others like KDE do honor the locale by default. Worse, even this behavior isn't consistent because both behaviors can be overridden by user-level configuration.

This is a mess. Paths should not be subject to re-interpretation depending on a matrix of each user's environment variables, least of all paths that are shared in one global namespace.

Filesystems on major non-Linux platforms – including Windows and at least one UNIX – avoid this mess by requiring their filenames to contain Unicode strings. NFS avoids this mess by declaring that paths are UTF-8. IPFS should do the same.

I think it's better to add a small amount of complexity to IPFS (validate that paths are UTF-8 and prohibit the most problematic characters) than to ignore the issue and shift complexity to everything else (forcing every other application to deal with IPFS paths that are not necessarily Unicode, that contain NULs, that contain newlines, etc.).

jbenet commented 9 years ago

Yes, I agree with UTF-8. I think it's time for the Path System to be native UTF-8. After all, the focus of paths is print (screens, humans, paper, etc), including societies which have different alphabets, and would want different characters in their paths.

— Sent from Mailbox

On Mon, Sep 21, 2015 at 2:05 PM, Will Glynn notifications@github.com wrote:

UNIX paths are arbitrary byte sequences, and that's confusing, because programs – and programmers – usually treat them as if they're encoded strings. Byte sequences are simple to handle until you need to do something with them, like display them in a user interface or print them to a log file, since essentially every context where text occurs outside a UNIX filesystem itself has a defined encoding that relates to Unicode in some specific way. UTF-8 strings are exactly as simple to handle as byte sequences, and they're much simpler than byte sequences to handle when you need a UTF-8 string, or a UTF-16 string, or a string in some other encoding. Processing potentially-malformed UTF-8 strings adds one additional step – you have to validate that your string is UTF-8 (which is not hard), but once you've done that, you're left with a UTF-8 string that's valid everywhere, including contexts that blindly accept bytes like FUSE. URLs are US-ASCII with a percent encoding system allowing arbitrary bytes. However, programs and programmers tend to work in terms of encoded strings, so just like UNIX filenames, many things assume that percent-encoded URLs correspond to valid UTF-8 and fail when they aren't. In the case of HTTP, this specifically includes ECMAScript. Looping back, yes, UNIX paths are arbitrary byte sequences… but these bytes are to be interpreted according to the user's C locale. Typical Linux users today run UTF-8 locales, so typical Linux filenames are already UTF-8. If a user is not running UTF-8, they'll have problems: common things like glib assume filenames are UTF-8 regardless of locale, while others like KDE do honor the locale by default. Worse, even this behavior isn't consistent because both behaviors can be overridden by user-level configuration. This is a mess. Paths should not be subject to re-interpretation depending on a matrix of each user's environment variables, least of all paths that are shared in one global namespace. Filesystems on major non-Linux platforms – including Windows and at least one UNIX – avoid this mess by requiring their filenames to contain Unicode strings. NFS avoids this mess by declaring that paths are UTF-8. IPFS should do the same.

I think it's better to add a small amount of complexity to IPFS (validate that paths are UTF-8 and prohibit the most problematic characters) than to ignore the issue and shift complexity to everything else (forcing every other application to deal with IPFS paths that are not necessarily Unicode, that contain NULs, that contain newlines, etc.).

Reply to this email directly or view it on GitHub: https://github.com/ipfs/go-ipfs/issues/1710#issuecomment-142108024

chriscool commented 9 years ago

If paths are UTF-8, I think there might be issues with normalizing or not the paths. If we normalize before hashing, people might not understand that what they will get out will not be binary identical with what they put in. If we don't normalize, people might not understand that UFT-8 identical paths produce different hashes.

mildred commented 9 years ago

I agree that we should consider the paths to be UTF-8 if to be interpreted as unicode, or even require that the string be valid UTF-8, but the primary representation of the path should be a byte string and not a unicode string. Else we have all the normalisation problems to deal with as @chriscool points out.

As such, IPFS should require path names to be valid UTF-8 but should also ensure that the byte string is left unmodified and can be byte-compared for identity.

Also, this is true for unixfs, but can't we imagine applications that would like to have binary keys ? Or will the applications that want to use binary strings will have to take a similar approach as lz-string: convert binary data to Unicode?

I can think for example than an application would want to store a public key as a link name. Would this application will have to convert it using Base64 or similar? Isn't it ironic that we can't store binary data when the wire format is binary and would let us do that easily?

willglynn commented 9 years ago

If paths are UTF-8, I think there might be issues with normalizing or not the paths. If we normalize before hashing, people might not understand that what they will get out will not be binary identical with what they put in. If we don't normalize, people might not understand that UFT-8 identical paths produce different hashes.

Yep – damned if you do, damned if you don't.

Normalizing means the "same" path gets the same hash even if it's different, but yep, it has the disadvantage of making the path you put in different than the path you get out. This can cause problems which seem more serious than the gains from normalizing.

Not normalizing means that the same character can be encoded using different sets of codepoints, which probably will cause problems where text as entered won't match text as encoded in IPFS. On the other hand, round-tripping paths through a human is a problem even with most canonicalization schemes, since many different characters are rendered using identical glyphs, e.g. Ω OHM SIGN U+2126 and Ω GREEK CAPITAL LETTER OMEGA U+03A9. Any scheme that intends to solve this problem will require a very complex set of logic, and it will always be incomplete. Unicode is designed to represent human language, but not to give that language a distinguished encoding.

HFS+ rewrites Unicode filenames in a way that supports their desired case-insesntivitity behaviors. Specifically, it performs Unicode decomposition using a fixed table that originally dated to Unicode 2.1. They updated this table in 10.3 to Unicode 3.2, requiring a special fsck to re-canonicalize filenames. IPFS doesn't have that option – if we normalize strings today, we must never change that logic.

I favor validating names but leaving them alone.

As such, IPFS should require path names to be valid UTF-8 but should also ensure that the byte string is left unmodified and can be byte-compared for identity.

Agreed. This is my approach in #1740.

Isn't it ironic that we can't store binary data when the wire format is binary and would let us do that easily?

The wire format would permit it, but enabling []byte names would undermine the greater utility of paths, which is that they can be used by humans: displayed, selected from lists, passed to programs as parameters, etc. I saw []byte as the other option, but @jbenet is firmly on the side of making paths useful to humans, which IMO means a) requiring Unicode and b) excluding specific codepoints that cause problems.

mildred commented 9 years ago

The wire format would permit it, but enabling []byte names would undermine the greater utility of paths, which is that they can be used by humans: displayed, selected from lists, passed to programs as parameters, etc. I saw []byte as the other option, but @jbenet is firmly on the side of making paths useful to humans, which IMO means a) requiring Unicode and b) excluding specific codepoints that cause problems.

For unixfs paths, I understand fully why we want them human readable, but there are many applications that could use ipfs beside unixfs.

I'm in favor for only allowing UTF-8 paths for unixfs, but not restrict paths for other data models, unless there is a good reason to.

whyrusleeping commented 9 years ago

I'm in favor for only allowing UTF-8 paths for unixfs, but not restrict paths for other data models, unless there is a good reason to.

:+1:

jbenet commented 9 years ago

@mildred the path name space is already limited, and not arbitrary-byte: / is disallowed -- or at best, has a specific meaning of drilling down a hierarchical structure. Once IPLD lands, you don't need to use the standard paths to traverse the graph, you could have merkle links there without the standard pathing, and with arbitrary byte keys. However, to the native ipfs path system has the restriction to be printable; that is to able to be pasted on filesystems, url bars, emails, newspapers, and so on. At this time, i strongly believe UTF-8 is the right thing to do.

FWIW:

jbenet commented 9 years ago

On normalizing, yeah this is a difficult decision. I'm not yet sure what to do. changing the user's data is -- to me, in general -- a source of problems. But as @willglynn mentions, damned if you do, damned if you don't.

It may be that in some implementation or mode, objects which are not normalized -- i.e. normalizing them changes the bytes -- are not silently changed but instead rejected with a meaningful error. This would bypass the uncertainty. (it sounds that's what you suggest, @willglynn?)

HFS+

I usually see HFS+ as a source of examples of what not to do. Though indeed useful to study.

djdv commented 8 years ago

@willglynn @jbenet I've encountered cross platform path related problems that may need to be considered, #2013. I'm posting about them here just to assure awareness. They should be uncommon edge cases but problems nonetheless. The biggest offender in that issue may be the case sensitive case. I think cases like these have to be permitted but in tools like ipfs get there should be methods to resolve these issues such as converting problematic characters on filesystems/platforms that do not allow them.

devonartis commented 3 years ago

Is this still an issue ?