github / codeql

CodeQL: the libraries and queries that power security researchers around the world, as well as code scanning in GitHub Advanced Security
https://codeql.github.com
MIT License
7.52k stars 1.5k forks source link

Potential false positive for "Uncontrolled data used in path expression" alert #17226

Open tieneupin opened 1 month ago

tieneupin commented 1 month ago

Description of the false positive

I'm writing functions to add files to an SQL database, and CodeQL has flagged that the file paths are a potential security risk.

I have constructed a basic validator to make sure that only file paths that exist and start with a given base path when resolved are accepted for subsequent processing. However, CodeQL still views this as being insufficient.

Is this a false positive, or can my validation check be further enhanced? Note that I use resolve so that I can get and compare the start of the file paths.

Code samples or links to source code

# Python version: 3.9.19
# OS: GNU/Linux RHEL8 4.18.0-553.5.1.el8_10.x86_64
from pathlib import Path

# The storage path is defined internally by an environment variable; here is a placeholder
storage_path = Path("/path/to/where/files/are/stored")

# Define a function to validate the file path provided
def validate_file(file: Path) -> bool:
    file = Path(file) if isinstance(file, str) else file  # Pre-empt accidental string inputs
    file = file.resolve()  # Get full path for validation

    # Fail if file doesn't exist
    if not file.exists():
        return False

    # Use path to storage location as reference
    basepath = list(storage_path.parents)[-2]  # This can be made stricter eventually
    if str(file).startswith(str(basepath)):
        return True
    else:
        return False

# How it's used in the script
incoming_file = Path("some_file.txt")  # Can be partial path, or full path on system

if validate_file(incoming_file) is True:
    """
    Run code here to store the file in the database
    """
    return True
else:
    raise Exception("This file failed the validation check")

URL to the alert on GitHub code scanning (optional) https://github.com/DiamondLightSource/python-murfey/pull/321/checks?check_run_id=28770940998

If this isn't a false positive, and CodeQL is working as intended, advice on mitigating the security issue in this context would be much appreciated. Thanks!

tieneupin commented 1 month ago

Update: I have attempted to improve the validation I'm using for the file paths I'm allowing into the database, and my current validation function looks like this:

from pathlib import Path
from os.path import normpath

def validate_file_path(file: Path) -> Path:
    """
    Validates the input file paths used for the workflow, making sure that they are
    safe to be executed and passed on to subsequent stages of the workflow. Returns
    a validated, fully resolved version of the file path.
    """

    file = Path(file) if isinstance(file, str) else file
    file = file.resolve()  # Get full path for inspection

    # Validate file types
    if not file.exists():
        raise Exception(f"{file!r} doesn't exist")
    if not file.is_file():
        raise Exception(f"{file!r} is not a file")
    if file.suffix not in valid_file_types:
        raise Exception(f"{file!r} is not a valid file type")

    # Try validating with os.path functions
    full_path = normpath(str(file))

    # Use path to storage location as reference to verify basepath is correct
    base_path = normpath("/path/to/where/data/should/be/stored")
    if str(full_path).startswith(str(base_path)):
        return Path(full_path).resolve()
    else:
        raise Exception(f"{file!r} points to a directory that is not permitted")

# The code block then becomes:
incoming_file = validate_file(file)
"""
Code block to register file in the database goes here
"""
return True

As per previous, would greatly appreciate your advice on whether this can be considered a false positive, or if I should further validate the file path. Thanks!

tieneupin commented 1 month ago

I've made another attempt at resolving this error, with my validation + sanitisation function looking like this now:

import re
from pathlib import Path

def validate_and_sanitise(file: Path) -> Path:
    """
    Validates the input file paths used for the CLEM workflow, making sure that they
    are safe to be executed and passed on to subsequent stages of the workflow. Returns
    a validated, sanitised version of the file path.
    """

    file = Path(file) if isinstance(file, str) else file

    # Convert to Posix str and check that no forbidden characters are therre
    full_path = file.resolve().as_posix()
    if bool(re.fullmatch(r"^[\w\s\.\-/]+$", full_path)) is False:
        raise ValueError(f"Unallowed characters present in {file!r}")

    # Check that it's not accessing somehwere it's not allowed
    base_path = Path("/path/to/where/data/should/be/stored")
    if not str(full_path).startswith(str(list(base_path.parents)[-3])):
        raise ValueError(f"{file!r} points to a directory that is not permitted")

    valid_file = Path(full_path)

    # Additional checks for file path given
    if not valid_file.is_file():
        raise ValueError(f"{file!r} is not a file")
    if not valid_file.exists():
        raise FileNotFoundError(f"{file!r} doesn't exist")
    if valid_file.suffix not in valid_file_types:
        raise ValueError(f"{file!r} is not a valid file format")

    return valid_file

# How it's used in the database function
def get_db_entry(...):
    # Validate file path if provided
    if file_path is not None:
        try:
            file_path = validate_and_sanitise(file_path)
        except Exception:
            raise Exception

    """
    Register file with database here
    """
    return db_entry

Unfortunately, this attempt also still fails to satisfy the CodeQL warning, so as previous, advice would be appreciated on how I could improve this check if it is indeed not a false positive.

tieneupin commented 4 weeks ago

Just another update on this situation:

I've finally been able to write a version of the sanitisation check that satisfies the CodeQL alert:

import re
from os import path
from pathlib import Path

def validate_and_sanitise(file: Path) -> Path:
    """
    Performs validation and sanitisation on the incoming file paths, ensuring that
    no forbidden characters are present and that the the path points only to allowed
    sections of the file server.

    Returns the file path as a sanitised string that can be converted into a Path
    object again.
    """
    full_path = path.normpath(path.realpath(file))
    if bool(re.fullmatch(r"^[\w\s\.\-/]+$", full_path)) is False:
        raise ValueError(f"Unallowed characters present in {file!r}")

    # Check that it's not accessing somehwere it's not allowed
    storage_path = Path("/path/to/where/data/is/stored")
    base_path = list(storage_path.parents)[-3].as_posix()
    if not str(full_path).startswith(str(base_path)):
        raise ValueError(f"{file!r} points to a directory that is not permitted")

    # Additional checks
    if f".{full_path.rsplit('.', 1)[-1]}" not in valid_file_types:
        raise ValueError("File is not a permitted file format")

    return Path(full_path)

# The rest of the function proceeds as previous
def get_db_entry(...):
    # Validate file path if provided
    if file_path is not None:
        try:
            file_path = validate_and_sanitise(file_path)
        except Exception:
            raise Exception

    """
    Register file with database here
    """
    return db_entry

Trying to resolve this issue revealed some interesting things about what CodeQL flags as an alert:

  1. I had to remove the file.exists() check from the list of validation + sanitisation checks run, as CodeQL flags it as "uncontrolled data use in a path expression"
  2. Resolving symlinks an getting the full file path using os.path worked, whereas doing the same with pathlib.Path got flagged as a security risk, even though both achieved the same end result of getting the full file path for inspection.

I've resolved this alert for my project now, but it might be worth considering discussing supporting the use of pathlib.Path in sanitisation and validation checks.

Looking forward to hearing your team's thoughts on this matter going forward. Thanks!