jborean93 / smbprotocol

Python SMBv2 and v3 Client
MIT License
320 stars 73 forks source link

Example/Documentation unclear for low level reading of file size. #255

Open heijligers opened 10 months ago

heijligers commented 10 months ago

I'm trying to use GPT4 to implement a python smb crawler that has to connect over a VERY SLOW connection with a Synology NAS with MILLIONS of files. Luckily I only need a subset of the folder and of the file types. Can someone help me get a basic version up and running. Both using various GPT tools and trying to parse the low level source code myself i haven't managed to get the following software design and reference implementation to work:

Prototype 3:

Yaml.conf: `top_folder_filter: P100* file_copy_extention_filter:

def main():

# Load configuration from YAML file
with open("config.yaml", "r") as file:
        config = yaml.safe_load(file)

# Samba client configuration
server_ip = config['server_ip']
username = config['server_user']
password = config['server_password']
share_name = config['share_name']
top_folder_filter = config['top_folder_filter']
file_copy_extention_filter = config['file_copy_extention_filter']
try:
    guid = uuid.uuid4()
    connection = Connection(guid, server_ip)
    connection.connect()
    session = Session(connection, username, password)
    session.connect()

    # Ensure the share name is correctly formatted as '\\server\share' before passing it to TreeConnect
    formatted_share_name = rf"\\{server_ip}\{share_name}"
    logging.info(f"Formatted Share Name: {formatted_share_name}")

    tree = TreeConnect(session, formatted_share_name)

    try:
        tree.connect()
    except SMBResponseException as e:
        logging.error(f"Error connecting to share: {e}")
        raise

    # Retry strategy for file download
    @retry(stop=stop_after_attempt(3), wait=wait_fixed(2))
    def download_file(tree, file_path, local_path):
        try:
            with Open(tree, file_path) as file:
                file.read(local_path)
                logging.info(f"Downloaded file: {file_path}")
                track_matched_file(file_path)
        except Exception as e:
            logging.error(f"Error downloading file {file_path}: {e}")
            raise

    # Function to check file extension
    def should_copy_file(file_name):
        return any(fnmatch.fnmatch(file_name, '*' + ext) for ext in file_copy_extention_filter)

    # Function to crawl directories
    def crawl_directory(tree, path=""):
        try:
            with Open(tree, path, desired_access=DirectoryAccessMask.FILE_LIST_DIRECTORY) as dir:
                for file_info in dir.query_directory("*"):
                    file_path = os.path.join(path, file_info['file_name'])
                    if file_info['file_attributes'] & FileAttributes.FILE_ATTRIBUTE_DIRECTORY:
                        if fnmatch.fnmatch(file_info['file_name'], top_folder_filter):
                            crawl_directory(tree, file_path)
                    elif should_copy_file(file_info['file_name']):
                        download_file(tree, file_path, f"local_directory/{file_info['file_name']}")
        except Exception as e:
            logging.error(f"Error crawling directory {path}: {e}")
    # Function to track matching files
    matched_files = []

    def track_matched_file(file_path):
        matched_files.append(file_path)
        logging.info(f"Tracking file: {file_path}")

    crawl_directory(tree)
except Exception as e:
    logging.error(f"Error in main: {e}")

if name == "main": main() '

attempt 2 (incomplete) ' import yaml from loguru import logger from tenacity import retry, stop_after_attempt, wait_exponential from smbprotocol.open import CreateDisposition, CreateOptions, DirectoryAccessMask, FileAttributes, \ FileInformationClass, ImpersonationLevel, Open, ShareAccess from contextlib import contextmanager from io import BytesIO from smbprotocol.connection import Connection from smbprotocol.session import Session from smbprotocol.open import CreateDisposition, FileAttributes, FilePipePrinterAccessMask, ImpersonationLevel, Open, \ ShareAccess from smbprotocol.tree import TreeConnect from smbprotocol.connection import Connection from smbprotocol.session import Session from smbprotocol.tree import TreeConnect from smbprotocol.connection import Connection from smbprotocol.session import Session from smbprotocol.open import CreateDisposition, CreateOptions, DirectoryAccessMask, FileAttributes, \ FileInformationClass, ImpersonationLevel, Open, ShareAccess from smbprotocol.tree import TreeConnect import uuid,sys

def smb_b_open(tree, mode='r', share='r', username=None, password=None, encrypt=True): """ Functions similar to the builtin open() method where it will create an open handle to a file over SMB. This can be used to read and/or write data to the file using the methods exposed by the Open() class in smbprotocol. Read and write operations only support bytes and not text strings.

:param tree: smbprotocol tree object
:param mode: The mode in which the file is to be opened, can be set to one of the following;
    'r': Opens the file for reading (default)
    'w': Opens the file for writing, truncating first
    'x': Create a new file and open it for writing, fail if the file already exists
:param share: The SMB sharing mode to set for the opened file handle, can be set to one or more of the following:
    'r': Allows other handles to read from the file (default)
    'w': Allows other handles to write to the file
    'd': Allows other handles to delete the file
:param username: Optional username to use for authentication, required if Kerberos is not used.
:param password: Optional password to use for authentication, required if Kerberos is not used.
:param enrypt: Whether to use encryption or not, Must be set to False if using an older SMB Dialect.
:return: The opened smbprotocol Open() obj that has a read, write, and flush functions.
"""

try:
    if mode == 'r':
        create_disposition = CreateDisposition.FILE_OPEN
        access_mask = FilePipePrinterAccessMask.GENERIC_READ
    elif mode == 'w':
        create_disposition = CreateDisposition.FILE_OVERWRITE_IF
        access_mask = FilePipePrinterAccessMask.GENERIC_WRITE
    elif mode == 'x':
        create_disposition = CreateDisposition.FILE_CREATE
        access_mask = FilePipePrinterAccessMask.GENERIC_WRITE
    else:
        raise ValueError("Invalid mode value specified.")

    share_map = {
        'r': ShareAccess.FILE_SHARE_READ,
        'w': ShareAccess.FILE_SHARE_WRITE,
        'd': ShareAccess.FILE_SHARE_DELETE,
    }
    share_access = 0
    for s in share:
        share_access |= share_map[s]

    obj = Open(tree, file_path)
    obj.create(
        ImpersonationLevel.Impersonation,
        access_mask,
        FileAttributes.FILE_ATTRIBUTE_NORMAL,
        share_access,
        create_disposition,
        0,
    )

    try:
        yield obj
    finally:
        obj.close()

class FileEntry(object):

def __init__(self, path, file_directory_info):
    self.name = file_directory_info['file_name'].value.decode('utf-16-le')
    self.path = r"%s\%s" % (path, self.name)
    self.ctime = file_directory_info['creation_time'].value
    self.atime = file_directory_info['last_access_time'].value
    self.wtime = file_directory_info['last_write_time'].value
    self.size = file_directory_info['allocation_size'].value
    self.attributes = file_directory_info['file_attributes'].value

    self.is_archive = self._flag_set(FileAttributes.FILE_ATTRIBUTE_ARCHIVE)
    self.is_compressed = self._flag_set(FileAttributes.FILE_ATTRIBUTE_COMPRESSED)
    self.is_directory = self._flag_set(FileAttributes.FILE_ATTRIBUTE_DIRECTORY)
    self.is_hidden = self._flag_set(FileAttributes.FILE_ATTRIBUTE_HIDDEN)
    self.is_normal = self._flag_set(FileAttributes.FILE_ATTRIBUTE_NORMAL)
    self.is_readonly = self._flag_set(FileAttributes.FILE_ATTRIBUTE_READONLY)
    self.is_reparse_point = self._flag_set(FileAttributes.FILE_ATTRIBUTE_REPARSE_POINT)
    self.is_system = self._flag_set(FileAttributes.FILE_ATTRIBUTE_SYSTEM)
    self.is_temporary = self._flag_set(FileAttributes.FILE_ATTRIBUTE_TEMPORARY)

def _flag_set(self, attribute):
    return self.attributes & attribute == attribute

Define _listdir helper function for applying a filter pattern and recursion to listing the content of a samba share,

specified by the tree variable

def _listdir(tree, path, pattern, recurse): full_path = tree.share_name if path != "": full_path += r"\%s" % path

    # We create a compound request that does the following;
    #     1. Opens a handle to the directory
    #     2. Runs a query on the directory to list all the files
    #     3. Closes the handle of the directory
    # This is done in a compound request so we send 1 packet instead of 3 at the expense of more complex code.
    directory = Open(tree, path)
    query = [
        directory.create(
            ImpersonationLevel.Impersonation,
            DirectoryAccessMask.FILE_LIST_DIRECTORY,
            FileAttributes.FILE_ATTRIBUTE_DIRECTORY,
            ShareAccess.FILE_SHARE_READ | ShareAccess.FILE_SHARE_WRITE,
            CreateDisposition.FILE_OPEN,
            CreateOptions.FILE_DIRECTORY_FILE,
            send=False
        ),
        directory.query_directory(
            pattern,
            FileInformationClass.FILE_DIRECTORY_INFORMATION,
            send=False
        ),
        directory.close(False, send=False)
    ]

    query_reqs = tree.session.connection.send_compound(
        [x[0] for x in query],
        tree.session.session_id,
        tree.tree_connect_id,
        related=True
    )

    # Process the result of the create and close request before parsing the files.
    query[0][1](query_reqs[0])
    query[2][1](query_reqs[2])

    # Parse the queried files and repeat if the entry is a directory and recurse=True. We ignore . and .. as they are
    # not directories inside the queried dir.
    entries = []
    ignore_entries = [".".encode('utf-16-le'), "..".encode('utf-16-le')]
    for file_entry in query[1][1](query_reqs[1]):
        if file_entry['file_name'].value in ignore_entries:
            continue

        fe = FileEntry(full_path, file_entry)
        entries.append(fe)

        if fe.is_directory and recurse:
            dir_path = r"%s\%s" % (path, fe.name) if path != "" else fe.name
            entries += _listdir(tree, dir_path, recurse)

    return entries

def main1():

Load configuration

with open('config.yaml', 'r') as file:
    config = yaml.safe_load(file)
    # Samba client configuration
    server_ip = config['server_ip']
    username = config['server_user']
    password = config['server_password']
    share_name = config['share_name']
    top_folder_filter = config['top_folder_filter']
    file_copy_extention_filter = config['file_copy_extention_filter']

# Initialize logging
logger.add("file.log")
logger.add(sys.stderr, format="{time} {level} {message}", filter="my_module", level="INFO")

# Establish connection to Samba server
# Here we will use the Connection, Session, and TreeConnect classes from smbprotocol to establish a connection to the Samba server.
# Initialize connection
connection = Connection(uuid.uuid4(), config['server_ip'], 445)
connection.connect()
session = Session(connection, config['server_user'], config['server_password'])
session.connect()
tree = TreeConnect(session, rf"\\{config['server_ip']}\{config['share_name']}")
tree.connect()

# Query the Samba share for top level folders qualifying the top_folder_filter
entries =_listdir(tree,"",top_folder_filter,False)
# prepare downloading file
create_disposition = CreateDisposition.FILE_OPEN
access_mask = FilePipePrinterAccessMask.GENERIC_READ
share_map = {
    'r': ShareAccess.FILE_SHARE_READ,
    'w': ShareAccess.FILE_SHARE_WRITE,
    'd': ShareAccess.FILE_SHARE_DELETE,
}
share_access = 0
share = 'r'
for s in share:
    share_access |= share_map[s]

# For each folder returned from the query, call a recursive function to crawl the folder.
# Here we will define a recursive function that takes a folder as an argument.
# This function will query the current folder using the method _listdir.
# For each file, it will download it using smb_b_open and log the download status.
for entry in entries:
    subentries = _listdir(tree, entry.name, "*", True)
    for subentry in subentries:
        if subentry.name.split('.')[-1] in file_copy_extention_filter:
            obj = Open(tree, subentry.path)
            obj.create(
                ImpersonationLevel.Impersonation,
                access_mask,
                FileAttributes.FILE_ATTRIBUTE_NORMAL,
                share_access,
                create_disposition,
                0,
            )
            file_info = obj.query_info(FileInformationClass.FILE_STANDARD_INFORMATION)
            file_size = file_info['end_of_file'].get_value()
            file_contents = obj.read(0, file_size)
            with open(subentry.name, 'wb') as local_file:
                local_file.write(file_contents)

# If an error occurs, log the error and if it's recoverable, retry the operation using tenacity.
# Here we will use the retry decorator from tenacity to automatically retry operations in case of recoverable errors. We can customize the retry logic by specifying the number of attempts, wait time, etc.

# Finally, close the connection to the Samba server and log the final state.
# Here we will use the disconnect method of the Connection class from smbprotocol to close the connection to the Samba server. We will also log the final state, which could include the number of files downloaded, the number of errors encountered, and the last directory or file that was processed.

if name == "main": main1()

'

Software Design Specification for a Remote Samba Share Crawler

Overview

The Remote Samba Share Crawler is designed to connect to a Samba share, crawl through its directories and files, and download specified files to a local directory. It supports various features like recursive crawling, threading, logging, and error handling.

Functional Requirements

  1. Connection Management: Establish and manage a connection to a Samba share using server IP, user credentials, and share name.
  2. Directory Crawling: Recursively list directories and files in the Samba share, starting from a specified base directory.
  3. File Downloading: Download files from the Samba share to a local directory, with support for retries and throttling.
  4. Logging: Log various operations and errors for debugging and monitoring.
  5. State Management: Maintain and save the state of crawling and downloading operations, allowing resumption from the last state in case of interruption.
  6. Configuration Management: Load and use configuration from an external file, allowing easy modification of parameters.
  7. Error Handling: Handle and log errors, particularly in connection establishment, file listing, and file downloading.

Non-functional Requirements

  1. Modularity: Code should be structured into distinct classes and functions for ease of maintenance and scalability.
  2. Performance: Efficient crawling and downloading, with the option to use threading to improve performance.
  3. Security: Secure handling of credentials and encryption of the connection where necessary.
  4. Flexibility: Ability to easily change the underlying Samba client library or logging framework.

Proposed Architecture

1. Classes and Modules

2. External Libraries

3. Configuration

4. Logging

5. Error Handling and Retry Logic

6. Threading and Concurrency

jborean93 commented 10 months ago

I would highly recommend you use the high level API, specifically smbclient.scandir to enumerate entries on a directory. There's not too much that you really gain by using the low level API here as I've tried to make the high level one as efficient as possible for the operations needed. Even just things like opening a file/directory can be done with the high level API and then using the raw file open object can be used for low level operations that might not be exposed in the high level API.

Ultimately I can't help you write your actual application, I can help if you have specific questions about smbprotocol that you may have but that's about it. If you don't have a specific question or query then I'll close this issue tomorrow.

heijligers commented 10 months ago

Thanks for your response. Does the high level api support using a filter pattern? Getting the top level folder share listing takes 30+ minutes as it contains tens of thousands of folders.

Thank you

On Tue, 5 Dec 2023 at 09:26, Jordan Borean @.***> wrote:

I would highly recommend you use the high level API, specifically smbclient.scandir to enumerate entries on a directory. There's not too much that you really gain by using the low level API here as I've tried to make the high level one as efficient as possible for the operations needed. Even just things like opening a file/directory can be done with the high level API and then using the raw file open object can be used for low level operations that might not be exposed in the high level API.

Ultimately I can't help you write your actual application, I can help if you have specific questions about smbprotocol that you may have but that's about it. If you don't have a specific question or query then I'll close this issue tomorrow.

— Reply to this email directly, view it on GitHub https://github.com/jborean93/smbprotocol/issues/255#issuecomment-1840249216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAX7BVZVS5VOSBM5ROFU6I3YH3LERAVCNFSM6AAAAABAG6IO7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBQGI2DSMRRGY . You are receiving this because you authored the thread.Message ID: @.***>

jborean93 commented 10 months ago

Yep, the search_pattern kwarg https://github.com/jborean93/smbprotocol/blob/37512ee0648ad64f98755833382fea790d9b2df6/src/smbclient/_os.py#L526 supports the normal server side filtering with * and ? that the underlying SMB server supports.

heijligers commented 10 months ago

Awesome! thanks! I am quite proud that I actually managed to get my first version using the smbprotocol to work well enough for my purposes. In the future I'll rely on smbclient for sure!

One last question, you might easily be able to answer for me. Is there a record of the username or owner who uploaded/created the file in the samba protocol?

Thanks again!

On Tue, 5 Dec 2023 at 19:41, Jordan Borean @.***> wrote:

Yep, the search_pattern kwarg https://github.com/jborean93/smbprotocol/blob/37512ee0648ad64f98755833382fea790d9b2df6/src/smbclient/_os.py#L526 supports the normal server side filtering with * and ? that the underlying SMB server supports.

— Reply to this email directly, view it on GitHub https://github.com/jborean93/smbprotocol/issues/255#issuecomment-1841401433, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAX7BV6GYKYBJVMQYIIH2ADYH5TELAVCNFSM6AAAAABAG6IO7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBRGQYDCNBTGM . You are receiving this because you authored the thread.Message ID: @.***>

-- Bjorn Heijligers +31620106733

jborean93 commented 10 months ago

The closest there is is the "Owner" of the file in the security descriptor. Unfortunately it's not reliable as on Windows this could be the Administrators group or whatever is set in the user's group sids as the owner. Plus getting that value will only give you the SID string in python, you still need a separate process to translate that to an account name which this library does not do.

heijligers commented 9 months ago

Thanks! SID might actually be enough. I'm only interested in knowing which files were created by the same users, not necessarily the name of the user.

On Thu, 7 Dec 2023 at 22:56, Jordan Borean @.***> wrote:

The closest there is is the "Owner" of the file in the security descriptor. Unfortunately it's not reliable as on Windows this could be the Administrators group or whatever is set in the user's group sids as the owner. Plus getting that value will only give you the SID string in python, you still need a separate process to translate that to an account name which this library does not do.

— Reply to this email directly, view it on GitHub https://github.com/jborean93/smbprotocol/issues/255#issuecomment-1846170683, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAX7BV44CC5J2GK7CQHZVNLYII3PPAVCNFSM6AAAAABAG6IO7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBWGE3TANRYGM . You are receiving this because you authored the thread.Message ID: @.***>

-- Bjorn Heijligers +31620106733