run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
1.79k stars 157 forks source link

In the processing of grey text with white background of the PDF will appear a large number of recognition errors in the messy code. #217

Open yebarryallen opened 3 weeks ago

yebarryallen commented 3 weeks ago

Using: parser = LlamaParse( result_type=ResultType.MD, language=Language.SIMPLIFIED_CHINESE, verbose=True, num_workers=1 ) Some of the identified error results are shown below: image

And the parsing process doesn't generate errors, so it would need to be checked manually when the results are checked, and if it generates errors perhaps Fast Mode could be used to remedy the situation?

This problem can be caused by the colour of the font being too close to the base colour, and by the font being too small, and it would be nice to be able to give a parameter specifically to deal with these cases.