hc-sc-ocdo-bdpd / file-processing

A metadata extraction tool for various file types
https://hc-sc-ocdo-bdpd.github.io/file-processing-tools/
MIT License
5 stars 3 forks source link

file-processing

A versatile Python library for generalizing and streamlining the processing of diverse file types. It provides a unified File class that uses the Strategy Pattern to select appropriate processors based on file extensions. The library supports over 20 unique file processors and is designed for easy extensibility.


Table of Contents


Features


Installation

To install the file-processing library from GitHub (since it's not packaged yet), use pip with the repository URL:

pip install git+https://github.com/hc-sc-ocdo-bdpd/file-processing.git

Note: Optional dependencies for OCR and transcription are available through file-processing-ocr and file-processing-transcription.


Quick Start

Here's how to get started with file-processing:

from file_processing import File

# Initialize a File object
file = File('path/to/your/file.pdf')

# Access metadata
print(f"File Name: {file.file_name}")
print(f"File Size: {file.size} bytes")
print(f"Owner: {file.owner}")

# Access extracted text (if applicable)
print(f"Text Content: {file.metadata.get('text', 'No text extracted')}")

Supported File Types

The library supports a wide range of file types:


Optional Features

The file-processing library can be extended with OCR and transcription capabilities by installing additional packages:


Architecture

The library utilizes the Strategy Pattern to select the appropriate processor based on the file extension. Here's how it works:


Extending the Library

To add support for a new file type:

  1. Create a New Processor Class:

    from file_processing.file_processor_strategy import FileProcessorStrategy
    
    class CustomFileProcessor(FileProcessorStrategy):
       def __init__(self, file_path: str, open_file: bool = True) -> None:
           super().__init__(file_path, open_file)
           self.metadata = {}
    
       def process(self) -> None:
           # Implement processing logic
           pass
    
       def save(self, output_path: str = None) -> None:
           # Implement save logic
           pass
  2. Register the New Processor in file.py:

    Add your new processor to the PROCESSORS dictionary in file_processing/file.py:

    File.PROCESSORS['.custom_extension'] = CustomFileProcessor
  3. Update the __init__.py File:

    Add an import statement for your new processor in file_processing/processors/__init__.py:

    from .custom_processor import CustomFileProcessor

Following these steps ensures your new processor is correctly integrated with the file-processing library.


Contributing

We welcome contributions from the community. If you'd like to contribute:


License

This project is licensed under the MIT License.


Contact

For questions or support, please contact:


We are committed to fostering collaboration and innovation within Health Canada and beyond. Explore our repository, contribute, or get in touch to learn more about our work.