swarmauri / swarmauri-sdk

a modular multimodal framework for ai applications
https://swarmauri.com
Apache License 2.0
81 stars 42 forks source link

[Feature Request]: TikaPDFParser #502

Open cobycloud opened 2 months ago

cobycloud commented 2 months ago

Feature Name

swarmauri_community/parsers/concrete/TikaPDFParser.py

Feature Description

Using Tika, extract text from PDF files

Motivation

To enable parsing of pdf documents

Potential Solutions

# pip install tika
from tika import parser
from typing import Any, Union, List, Literal
from swarmauri.parsers.base.ParserBase import ParserBase
from swarmauri.core.documents.IDocument import IDocument
from swarmauri.standard.documents.concrete.Document import Document

class TikaPDFParser(ParserBase):
    """
    Parser for reading and extracting text from PDF files using Tika.
    """
    type: Literal['TikaPDFParser'] = 'TikaPDFParser'

    def parse(self, source: str) -> List[IDocument]:
        parsed = parser.from_file(source)
        text = parsed['content']
        return [Document(content=text)]

Additional Context (optional)

No response

Affected Areas

None

Priority

Low

Required Files

vatsalrathod16 commented 2 months ago

I can work on this feature.