ispras / dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
Apache License 2.0
109 stars 15 forks source link

LLM compatible json #429

Closed arslan1510 closed 1 month ago

arslan1510 commented 2 months ago

Heys guys, great stuff you have here, i just wanted to know, that is there any way to feed the parsed output to llm? would need to make chunks which doesnt exceed specific size and have this sections like llmsherpa?

NastyBoget commented 2 months ago

Hello! We are planning to make something like that for langchain (https://python.langchain.com/docs), this library also simplifies work with LLM. In the future, we want to implement our custom Document Loader (https://python.langchain.com/docs/modules/data_connection/document_loaders/) and then you can use some Text Splitter (https://python.langchain.com/docs/modules/data_connection/document_transformers/) for making chunks of fixed length.

Now dedoc doesn't support making chunks, it supports only making TreeNode for each paragraph of the text, but its length isn't limited by specific size.

arslan1510 commented 2 months ago

Ahh cool, will for sure contribute to this whenever i can, closing this issue and thanks for replying.

tejeshbhalla commented 1 month ago

Ahh cool, will for sure contribute to this whenever i can, closing this issue and thanks for replying.

Where's your contribution?

NastyBoget commented 1 month ago

We are in the process of writing code for langchain (https://python.langchain.com/docs), it will be there if they approve our pull request (we haven't done PR yet)