run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
1.79k stars 157 forks source link

Modify `_get_sub_docs` to use Custom Separator #254

Open adreichert opened 6 days ago

adreichert commented 6 days ago

Summary

This PR modifies _get_sub_docs to use the separator passed into the LlamaParse constructor. I'm making this change as the string \n---\n occurs occasionally in our documents. If pagination is important, we need to use a separator less likely to occur in our documents such as \n$$$$$$$$\n.

Testing

CLI

Automated Tests passed

% export LLAMA_CLOUD_API_KEY=llx-[...]
% make test                                                                      
pytest tests
====================================================================== test session starts =======================================================================
platform darwin -- Python 3.11.7, pytest-8.2.2, pluggy-1.5.0
rootdir: /Users/areichert/Documents/llama_parse
configfile: pyproject.toml
plugins: anyio-4.4.0
collected 3 items                                                                                                                                                

tests/test_reader.py ...                                                                                                                                   [100%]

======================================================================= 3 passed in 15.01s =======================================================================

Test Script

We parsed this two page document, which has a \n---\n where the background color changes.

[...]
**TO HELP ENGAGE EMPLOYEES**

---

Fulkrum has been providing inspection, [...]

Test Script

import llama_parse
LLAMAPARSE_API_KEY = '[...]'
OPENAI_API_KEY = '[...]'

def parse(split_by_page, page_separator):
    print(f"{split_by_page=}, {page_separator=}")
    parser = llama_parse.LlamaParse(
        result_type='markdown',
        api_key=LLAMAPARSE_API_KEY,
        verbose=False,
        invalidate_cache=True,
        gpt4o_mode=True,
        gpt4o_api_key=OPENAI_API_KEY,
        ignore_errors=True,
        split_by_page=split_by_page,
        page_separator=page_separator,
    )
    result = parser.load_data('fulkrum.pdf')
    print(f"{len(result)} pages")

if __name__ == '__main__':
    parse(False, None)
    parse(True, None)
    parse(True, "\n$$$$$$$$\n")

Results