A versatile Python library for generalizing and streamlining the processing of diverse file types. It provides a unified File
class that uses the Strategy Pattern to select appropriate processors based on file extensions. The library supports over 20 unique file processors and is designed for easy extensibility.
File
class.file-processing-ocr
and file-processing-transcription
.To install the file-processing
library from GitHub (since it's not packaged yet), use pip with the repository URL:
pip install git+https://github.com/hc-sc-ocdo-bdpd/file-processing.git
Note: Optional dependencies for OCR and transcription are available through file-processing-ocr
and file-processing-transcription
.
Here's how to get started with file-processing
:
from file_processing import File
# Initialize a File object
file = File('path/to/your/file.pdf')
# Access metadata
print(f"File Name: {file.file_name}")
print(f"File Size: {file.size} bytes")
print(f"Owner: {file.owner}")
# Access extracted text (if applicable)
print(f"Text Content: {file.metadata.get('text', 'No text extracted')}")
The library supports a wide range of file types:
.txt
, .csv
, .json
, .xml
, .html
, .py
, .ipynb
, .gitignore
.pdf
, .docx
, .rtf
, .xlsx
, .pptx
, .msg
.png
, .jpg
, .jpeg
, .gif
, .tif
, .tiff
, .heic
, .heif
.mp3
, .wav
, .mp4
, .flac
, .aiff
, .ogg
.zip
.whl
, .exe
.gguf
(used with file-processing-models
)The file-processing
library can be extended with OCR and transcription capabilities by installing additional packages:
The library utilizes the Strategy Pattern to select the appropriate processor based on the file extension. Here's how it works:
File
class acts as a context that delegates the processing to a specific FileProcessorStrategy
.FileProcessorStrategy
interface.GenericFileProcessor
is used as a fallback.To add support for a new file type:
Create a New Processor Class:
from file_processing.file_processor_strategy import FileProcessorStrategy
class CustomFileProcessor(FileProcessorStrategy):
def __init__(self, file_path: str, open_file: bool = True) -> None:
super().__init__(file_path, open_file)
self.metadata = {}
def process(self) -> None:
# Implement processing logic
pass
def save(self, output_path: str = None) -> None:
# Implement save logic
pass
Register the New Processor in file.py
:
Add your new processor to the PROCESSORS
dictionary in file_processing/file.py
:
File.PROCESSORS['.custom_extension'] = CustomFileProcessor
Update the __init__.py
File:
Add an import statement for your new processor in file_processing/processors/__init__.py
:
from .custom_processor import CustomFileProcessor
Following these steps ensures your new processor is correctly integrated with the file-processing
library.
We welcome contributions from the community. If you'd like to contribute:
This project is licensed under the MIT License.
For questions or support, please contact:
We are committed to fostering collaboration and innovation within Health Canada and beyond. Explore our repository, contribute, or get in touch to learn more about our work.