target / strelka

Real-time, container-based file scanning at enterprise scale
Other
877 stars 113 forks source link

ScanEmail: Restrict Access in WeasyPrint #459

Closed phutelmyer closed 4 months ago

phutelmyer commented 5 months ago

Add local_fetch_only Function to Restrict External Network Access in WeasyPrint

Description

This PR introduces a custom URL fetcher function, local_fetch_only, for WeasyPrint. The purpose of this function is to prevent any external network access during the fetching process. It allows only local file paths, base64 encoded data, and relative URLs. All other URLs, including HTTP, HTTPS, FTP, and IP addresses, are blocked. Previously, external calls were observed for things like CSS files and such. This should be restricted.

Implementation

The local_fetch_only function is designed to:

For all other URL schemes (e.g., http, https, ftp), the function returns an empty response, effectively blocking the request.

Code

from urllib.parse import urlparse
from weasyprint import default_url_fetcher

def local_fetch_only(url, *args, **kwargs):
    """
    Custom URL fetcher for WeasyPrint that prevents any external network access.

    This function allows only local file paths, base64 encoded data, and relative URLs. It blocks all other URLs,
    including HTTP, HTTPS, FTP, and IP addresses, ensuring that no external network access occurs during the fetching
    process.

    Args:
        url (str): The URL to fetch.
        *args: Additional positional arguments.
        **kwargs: Additional keyword arguments.

    Returns:
        dict: A dictionary containing an empty string for 'string', 'text/plain' for 'mime_type', and 'utf8' for 'encoding'
              if the URL is blocked. Otherwise, it uses the default fetcher for local resources.
    """
    parsed_url = urlparse(url)

    # Allow base64 encoded data, local file paths, or relative URLs
    if parsed_url.scheme in ('data', 'file', ''):
        return default_url_fetcher(url, *args, **kwargs)

    # Block all other URLs (http, https, ftp, IP addresses, etc.)
    return {
        'string': '',
        'mime_type': 'text/plain',
        'encoding': 'utf8'
    }

Reasoning

The primary motivation for this implementation is security. By blocking external network requests, we ensure that WeasyPrint cannot inadvertently leak data or fetch resources from untrusted sources. This will prevent some resources from loading but ultimately is safer.

Control

Allowing only local file paths and base64 encoded data provides fine-grained control over the resources that can be accessed. Relative URLs are permitted to ensure that internal resources can still be referenced without specifying the full URL.

Use Cases

  1. Base64 Data URL

    <img src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mP8/wcAAwAB/gtLZ4cAAAAASUVORK5CYII=">
  2. Local File URL

    <img src="file:///path/to/local/image.png">
  3. Relative URL

    <img src="/images/local_image.png">

Describe testing procedures An additional test with an eml file was created to test the retrieval of external (fake) resources. This test will produce a thumbnail of an image without additional tags

Sample output If this change modifies Strelka's output, then please include a sample of the output here.

Checklist

phutelmyer commented 4 months ago

Closing for preference to #462