kaixindelele / ChatPaper

Use ChatGPT to summarize the arXiv papers. 全流程加速科研,利用chatgpt进行论文全文总结+专业翻译+润色+审稿+审稿回复
18.19k stars 1.91k forks source link

A PR to fix some bug when dealing with pdf file. header&footer ,missing information when splitting chapter,etc. #193

Open Wall-ee opened 1 year ago

Wall-ee commented 1 year ago

1, add funtion to remove header and footer 2, fix bug of missing to deal with chapter which is the final keys of chapter list 3, update the replace method to replace some useless utf-8 char in some paper 4, fix bug of merging the text together when chapter is over 1 page

提交一些PR来修复pdf 处理中一些棘手的问题。这几个问题在理论性强一些的文献中比较重要。

kaixindelele commented 1 year ago


Wall-ee commented 1 year ago

这个是一个好主意,我可以处理一下,不过刚才提的例子当中,有一些是通用性的问题,我找个示例pdf把。我这边主要是生物医药的论文,排版都比较奇葩一些,pymupdf 有时候默认顺序会出错,所以当一个章节跨页的时候,txt会拼接错误