jborean93 / smbprotocol

Python SMBv2 and v3 Client
MIT License
322 stars 73 forks source link

smbclient: xattr commands set Extended Attributes. Should they use Alternate Data Streams instead? #292

Closed mon closed 1 month ago

mon commented 1 month ago

I've fallen into a bit of a rabbit hole today regarding xattrs.

There are currently two ways (connecting to a Windows SMB server, in this case my computer set to share a folder) to write extra attributes to files.

Alternate Data Streams, which uses the FileStreamInformation type

with smbclient.open_file(path + ":ntfs_stream", "wb") as f:
    f.write(b"Using FileStreamInformation")

Also seen in the smbprotocol tests.

Extended Attributes, which uses the FileFullEaInformation type

smbclient.setxattr(path, "extended_attribute", "Using FileFullEaInformation")

The former is what ntfs-3g appears to map the concept of linux xattrs to (seen here), and is viewable using commandline tools on Windows:

D:\>dir /r test.txt
 Directory of D:\

11/10/2024  08:48 PM                12 test.txt
                                    27 test.txt:ntfs_stream:$DATA
               1 File(s)             12 bytes
               0 Dir(s)  xxx bytes free

The latter, far as I can tell, is only retrievable by Windows C APIs such as NtSetEaFile (seen here).

Anecdotally, I can also confirm that macOS maps xattr calls over mounted SMB shares to the FileStreamInformation type.

In general, it's really very hard to find information on "NTFS Extended Attributes" because most people call Alternate Data Streams "Extended attributes" when they write about them. The best summary I found was part of an addendum to a SO question.

Based on what other clients and servers do, would it be reasonable to move smbclient's xattr implementation to instead use FileStreamInformation instead of FileFullEaInformation? Or did I miss something obvious!

jborean93 commented 1 month ago

NTFS Extended Attributes are certainly a niche topic and is typically only really used by Windows components but I do not equate them to Alternate Data Streams. EA are a lot more limited than ADS and is thus reflected by the API whereas ADS can store large amounts of data and are essentially another file in that path.

Even if I wanted to change it, this would be a breaking change.

mon commented 1 month ago

Understandable re: it being a breaking change, and thanks for the clarification. I discovered this morning that Cygwin treats xattrs as EAs as well, so there's not quite the consistency I thought there was.

Would you accept a PR adding a nicer API for reading ADSs? I already discovered a few odd footguns, such as attribute names with ':' in their title being transparently mapped to \uff20.

I can think of three ways to implement it

jborean93 commented 1 month ago

I'm on the fence about this, to me I think the normal Windows path format with \\server\share\file:ads_name is consistent with how Windows itself treats ADS entries. You should be able to create/edit ADS entries with the smbclient.open_file and smbclient.remove should be able to delete them. The only API that's really missing is one to enumerate ADS names which we can certainly add if needed.

I already discovered a few odd footguns, such as attribute names with ':' in their title being transparently mapped to \uff20

Are you able to expand on this problem?

mon commented 1 month ago

I think the normal Windows path format ... is consistent with how Windows itself treats ADS entries

Yeah that's fair, I'd be fine with that continuing to be the way to open ADS stuff.

Are you able to expand on this problem?

I spent an hour yesterday trying to find some source code to point at (especially because I want to know if it includes other characters), but came up empty. I'll look again tomorrow. I don't want to propose anything without having some code to point to.

mon commented 1 month ago

Ok, I finally found the document explaining the character conversions, from an old Microsoft KB article that disappeared. https://mskb.pkisolutions.com/kb/117258

You'll note this doesn't mention the colon. That's explained by looking at the macOS NTFS source.

jborean93 commented 1 month ago

I'm unsure what the problem is sorry, I can potentially see the macOS client converting : to a unicode characters mentioned but smbprotocol/SMB itself doesn't do anything special here. Doing : in the filename will separate the filename and ADS entry as expected. Are you saying smbprotocol will also do the conversion?

mon commented 1 month ago

SMB doesn't do anything special here, you're right - what happens at the moment is it'll error when creating illegal filenames (assuming it's an NTFS server).

I finally found the right config option (mapchars) used in the Linux cifs mount driver (which implements SMB support), so it's not just macOS that implements this mapping. They started it though, software such as TrueNAS calls it "Apple Style Character Encoding".

Honestly though it's a bit of a mess (there's also a horrible 8.3 mapping option), and I understand if this isn't something you want to support. I would be totally happy with just implementing a "list streams for file" function and leave all this madness sitting in this issue.

jborean93 commented 1 month ago

I think adding the list makes sense to me. The client will work directly with the SMB protocol and doesn’t deal with any of the path char mappings other clients do. I don’t think trying to support the same magic behaviour here makes any sense.

Ultimately we provide a path to the server as is without any changes, it’s up to the SMB server to decide how to treat the chars in the path. For Windows the : is valid in the filename if the underlying volume supports ADS (which NTFS does).

mon commented 1 month ago

Alright, I'll get on that! An example list of streams:

:test_ads_stream:$DATA
:com.apple.provenance:$DATA
::$DATA

Should I return the names as-is, or split off the $DATA at the end?

jborean93 commented 1 month ago

As is, IIRC you can specify custom tags in the ADS (default is $DATA) so just giving back the values that SMB returns makes more sense to me.