Open soyeb-PQ opened 23 hours ago
Currently, we use HTML to represent tables because native Markdown tables do not support advanced operations such as merging cells. If you need to convert the table to Markdown format, you can try using an HTML parsing library like BeautifulSoup to parse and convert the HTML table.
Is your feature request related to a problem? Please describe. 您的特性请求是否与某个问题相关?请描述。 Currently, table data is being extracted in HTML format, and there is no option to extract it as plain text/markdown. This limitation makes it challenging to work with text data as I am unable to directly access table data in a simpler format.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] 对存在的问题进行清晰且简洁的描述。例如:我一直很困扰的是 [...] I was working on a PDF where I needed to extract text and later highlight specific sections within an application. To achieve this, I would prefer the table data to be in plain text format rather than HTML. This applies to both tablemaster and struct_eqtable giving html.
Describe the solution you'd like 描述您期望的解决方案 I would like the option to extract table data as plain text or Markdown, similar to how LlamaIndex provides this feature. The attached screenshots illustrate the differences between the original PDF and the table data extraction methods used by LlamaIndex and MinnerU
A clear and concise description of what you want to happen. 清晰且简洁地描述您希望实现的内容。 I would like the model to provide an option to output table data in plain text or Markdown, similar to LlamaIndex.
Additional context 提供更多细节 The LlamaIndex(https://cloud.llamaindex.ai/) extraction option is available in their Premium mode (with a high accuracy level at 15 credits per page). While LlamaIndex provides good results, it can be costly for processing a large number of PDFs. For this reason, I am currently exploring MinnerU as a more feasible alternative.