soilwise-he / soil-health-knowledge-graph

Repository for the soil health knowledge graph
MIT License
1 stars 0 forks source link

Parse the PDF report to a machine-readable format #7

Closed wbcbugfree closed 1 week ago

wbcbugfree commented 5 months ago

We need to parse the PDF report to a machine-readable format to facilitate the Python script reading the file. Here are some existing tools:

A predictable problem is the splicing of the same paragraph across pages. A paragraph may be split into two paragraphs due to a page break, and the vast majority of current tools are unable to solve this problem, which may require manual inspection.

wbcbugfree commented 1 month ago

See report.