When GPT-4o mode is enabled LlamaParse "transforms the document into an image per page and uses OpenAI GPT-4o to convert it into Markdown." When no custom instruction is used, this causes the incorrect handling of hyperlinks:
Image conversion results in the loss of information about the destination URLs
GPT-4o then assumes all underlined text is a hyperlink inserts a usually-incorrect destination URL
We observed the following issues.
Links aren't correctly parsed. We observed the issue when underlined hyperlinks pointing to sign-in pages were replaced by links pointing to the top-level page.
Any underlined text turned into hallucination URLs. For example underlined Amazon would parse to [Amazon](https://www.amazon.com)
Some links were rendered empty Markdown links [text](#)
Workaround
In testing, adding instructions "Don't render Markdown links" prevented links from appearing in the parsed markdown.
Summary
When GPT-4o mode is enabled LlamaParse "transforms the document into an image per page and uses OpenAI GPT-4o to convert it into Markdown." When no custom instruction is used, this causes the incorrect handling of hyperlinks:
We observed the following issues.
[Amazon](https://www.amazon.com)
[text](#)
Workaround
In testing, adding instructions "Don't render Markdown links" prevented links from appearing in the parsed markdown.
Example
I parsed following PDF using the Web UI at https://cloud.llamaindex.ai/parse. demo.pdf
In the original PDF, the last word of this sentence is a link.
Underlined text is turned into links.
In the original PDF, the last word of this sentence is a link.
Underlined text is turned into links.