Summary

This PR modifies _get_sub_docs to use the separator passed into the LlamaParse constructor. I'm making this change as the string \n---\n occurs occasionally in our documents. If pagination is important, we need to use a separator less likely to occur in our documents such as \n$$$$$$$$\n.

Testing

CLI

Automated Tests passed

% export LLAMA_CLOUD_API_KEY=llx-[...]
% make test                                                                      
pytest tests
====================================================================== test session starts =======================================================================
platform darwin -- Python 3.11.7, pytest-8.2.2, pluggy-1.5.0
rootdir: /Users/areichert/Documents/llama_parse
configfile: pyproject.toml
plugins: anyio-4.4.0
collected 3 items                                                                                                                                                

tests/test_reader.py ...                                                                                                                                   [100%]

======================================================================= 3 passed in 15.01s =======================================================================

Test Script

We parsed this two page document, which has a \n---\n where the background color changes.

[...]
**TO HELP ENGAGE EMPLOYEES**

---

Fulkrum has been providing inspection, [...]

Test Script

import llama_parse
LLAMAPARSE_API_KEY = '[...]'
OPENAI_API_KEY = '[...]'

def parse(split_by_page, page_separator):
    print(f"{split_by_page=}, {page_separator=}")
    parser = llama_parse.LlamaParse(
        result_type='markdown',
        api_key=LLAMAPARSE_API_KEY,
        verbose=False,
        invalidate_cache=True,
        gpt4o_mode=True,
        gpt4o_api_key=OPENAI_API_KEY,
        ignore_errors=True,
        split_by_page=split_by_page,
        page_separator=page_separator,
    )
    result = parser.load_data('fulkrum.pdf')
    print(f"{len(result)} pages")

if __name__ == '__main__':
    parse(False, None)
    parse(True, None)
    parse(True, "\n$$$$$$$$\n")

Results

When not splitting by page, the output list contains 1 element
When using the default separator, we get five, an incorrect number of pages

When using the separator that doesn't appear in the document, we get the correct number of pages.

% python test2.py
split_by_page=False, page_separator=None
1 pages
split_by_page=True, page_separator=None
5 pages
split_by_page=True, page_separator='\n$$$$$$$$\n'
2 pages

run-llama / llama_parse

Modify `_get_sub_docs` to use Custom Separator #254

Summary

Testing

CLI

Test Script

Test Script

Results