jborean93 / smbprotocol

Python SMBv2 and v3 Client
MIT License
320 stars 73 forks source link

Difference between SMBDirEntry._dir_info and .stat() #250

Closed rikroe closed 10 months ago

rikroe commented 10 months ago

Hi,

many thanks for your library! I am exploring to use it to connect to Windows Server SMB shares. This lead to some questions on the difference of the time objects retrieved from scandir() vs. stat().

Use case

  1. Listing all files in a directory
  2. Filtering new or modified files based on modified date against a checkpoint
  3. Parsing the final set of files

I'm looking for a very efficient way to do this, as sometimes we have >50k files in a single folder (we try to reduce this, but unfortunately this is not always possible).

Test

I started working with a smaller directory of ~150 files/folders and already see a large (order of magnitude) runtime difference, depending on the source of modified date access I am using.

For both tries, I am using scandir(). It is very fast (122ms) if I extract the last_write_time via SMBDirEntry._dir_info["last_write_time"]. If use SMBDirEntry.stat().st_mtime, it takes more than an order of magnitude longer (1.92s, I have also seen 2.5s). My guess is that this will even increase as more files have to be checked, as the main issue is probably the network latency due to additional stat calls.

Speed-wise, I am leaning to just use _dir_info, because it is fast. However, it is a private attribute and there are also some slight differences of 1 ms (?!?) on some files/folders (see below).

Question

The main question would be if it is safe to use _dir_info to get (at least second level granularity) modification dates in an efficient way.

Example

(The data was only stored in a dict to make easier comparisons on single files later on. It does not have any performance impact.)

image

Thank you!

jborean93 commented 10 months ago

The main question would be if it is safe to use _dir_info to get (at least second level granularity) modification dates in an efficient way.

I don't have any plans on changing it but as you said it's not part of the public API so I can't guarantee it'll always stay the same. I can see the benefits of exposing some of the fields in the FileIdFullDirectoryInformation object that exposes but I would probably just make them their own read only property on the SMBDirEntry object or at least expose a new custom dataclass/namedtuple with the data it returns rather than expose the underlying low level API object.

Unfortunately the stat() call requires a bit more information than what the full directory info one returns. It could omit at least one of the query operations it requires but it's still going to be the same round trips just with a tiny bit less work than what the server needs to do.

As for the time differences I'm honestly not sure why that would be. The value is derived from the raw FILETIME integer value which is 100s of nanoseconds since 1601-01-01. It would be curious if there was a bug in smbprotocol or whether the server is actually returning a different value here. To see that you'll have to either capture the network traffic yourself and compare or enable debug logs and see what is contained for these fields.

jborean93 commented 10 months ago

I think the time differences is a product of how division works at this precision. I cannot reason why but you can compare the following values

# From dir_info
time = datetime.datetime(2023, 11, 24, 16, 40, 25, 284090)

# From stat, basically the raw value * 100
time_ns = 1700844025284090400

# From stat, time_ns / 1000000000
time = 1700844025.2840905

We can see that time is just time_ns divided by number of nanoseconds in a second. At this level of precision it is Python who is changing that value (for a reason I don't know). I don't think I can do much about this sorry.

In the meantime here is a PR to add the new dir_info results in a more public API https://github.com/jborean93/smbprotocol/pull/251.

adiroiban commented 10 months ago

Just some info.

The time difference can be an issue if you do sync oprerations

I don't know how this can be fixed with float.

Python 3.11.6 (b5ae34b, Oct  9 2023, 10:31:15) [GCC 7.3.1 20180712 (Red Hat 7.3.1-17)] on linux
>>> 1700844025284090400 / 1000000000
1700844025.2840905
>>> 1700844025284090400 / 1000000000.0
1700844025.2840903
>>> test = 1700844025.2840904
>>> test
1700844025.2840905
>>> from decimal import Decimal
>>> Decimal('1700844025284090400')/ 1000000000
Decimal('1700844025.2840904')
adiroiban commented 10 months ago

Maybe one option is to also keep the values as integers in SMBStatResult

jborean93 commented 10 months ago

Maybe one option is to also keep the values as integers in SMBStatResult

The raw value from SMB is the FILETIME values which is essentially the st_atime_ns value which is an integer. Making the non nanoseconds ones a whole integer rather than decimal means we loose even more precision plus would be a breaking change. If you need precision at that level then use the *_ns variants.

adiroiban commented 10 months ago

Stupid me. I now seen the *_ns variants. All good. Thanks and sorry for the noise.

rikroe commented 10 months ago

Thank you for the quick and thorough answer!

Just a quick question: How is the conversion done for scandir()'s raw_dir_info? It seems it is already converted in query_directory() but I cannot find out how the conversion is done.

jborean93 commented 10 months ago

Thanks and sorry for the noise.

No need to apologise, always good to have another perspective.

Just a quick question: How is the conversion done for scandir()'s raw_dir_info? It seems it is already converted in query_directory() but I cannot find out how the conversion is done.

It happens a few layers down when unpacking the response https://github.com/jborean93/smbprotocol/blob/master/src/smbprotocol/open.py#L735. Essentially the request and the associated file class id is used when unpacking the response to determine the structure it should be.