kaixindelele / ChatPaper

Use ChatGPT to summarize the arXiv papers. 全流程加速科研,利用chatgpt进行论文全文总结+专业翻译+润色+审稿+审稿回复
https://chatwithpaper.org
Other
17.93k stars 1.9k forks source link

处理一些没有Introduction 的文章时会报错Introduction error 。 #42

Closed Hessen525 closed 1 year ago

Hessen525 commented 1 year ago

Traceback (most recent call last): File "C:\Users\admin\Documents\GitHub\ChatPaper\chat_paper.py", line 471, in main(args=args) File "C:\Users\admin\Documents\GitHub\ChatPaper\chat_paper.py", line 436, in main paper_list.append(Paper(path=os.path.join(root, filename))) File "C:\Users\admin\Documents\GitHub\ChatPaper\get_paper_from_pdf.py", line 17, in init self.parse_pdf() File "C:\Users\admin\Documents\GitHub\ChatPaper\get_paper_from_pdf.py", line 33, in parse_pdf self.section_text_dict.update({"paper_info": self.get_paper_info()}) File "C:\Users\admin\Documents\GitHub\ChatPaper\get_paper_from_pdf.py", line 42, in get_paper_info introduction_text = self.section_text_dict['Introduction'] KeyError: 'Introduction'

kaixindelele commented 1 year ago

这种问题,比较尴尬了,一般来说论文都会有introduction的章节的~能贴一下你的论文名称么?我看看是什么诡异的文章。 另外暂时我们还没有时间考虑这种极少数的情况,后面等gpt4出来,我们直接长文本喂入,应该可以直接解决这个问题~

Hessen525 commented 1 year ago

@kaixindelele COMMUNICATIONS OF THE ACM 里的文章都没有, ex:https://cacm.acm.org/magazines/2023/3/270210-ai-and-neurotechnology/fulltext

Nature有些子刊也没有Introduction

Hessen525 commented 1 year ago

放了20篇全都有Introduction的文章,但是也出现了这个“Introduction”的报错。。。看报错代码没看出来是具体哪一篇识别有问题。

beankin commented 1 year ago

+1 同出现这个问题,在用的

python chat_paper.py --query "ferroptosis" --filter_keys "ferroptosis" --max_results 5

时候是这样的反馈:

Key word: reinforcement learning
Query: ferroptosis
Sort: SortCriterion.Relevance
all search:
0 Ferroptosis as a Biological Phase Transition I: avascular and vascular tumor growth 2021-08-26 18:35:46+00:00
1 DCcov: Repositioning of Drugs and Drug Combinations for SARS-CoV-2 Infected Lung through Constraint-Based Modelling 2021-03-25 13:51:37+00:00
filter_keys: ferroptosis
筛选后剩下的论文数量:
filter_results: 2
filter_papers:
0 Ferroptosis as a Biological Phase Transition I: avascular and vascular tumor growth 2021-08-26 18:35:46+00:00
1 DCcov: Repositioning of Drugs and Drug Combinations for SARS-CoV-2 Infected Lung through Constraint-Based Modelling 2021-03-25 13:51:37+00:00
All_paper: 2
paper_path: ./pdf_files/ferroptosis-2023-03-17-00/Ferroptosis as a Biological Phase Transition I_ avascular and vascular tumor growth.pdf
section_page_dict {'Abstract': 0}
0 Abstract 0
download_error: 'Introduction'
paper_path: ./pdf_files/ferroptosis-2023-03-17-00/DCcov_ Repositioning of Drugs and Drug Combinations for SARS-CoV-2 Infected Lung through Constraint-Based Modelling.pdf
section_page_dict {'Abstract': 0}
0 Abstract 0
download_error: 'Introduction'
summary time: 33.19153451919556

全部都是download_error: 'Introduction' 然后直接把pdf文件夹喂给它:

python chat_paper.py --pdf_path "./pdf_files/ferroptosis-2023-03-16-23"

依然是'Introduction'错误,这次是这样:

Key word: reinforcement learning
Query: all: ChatGPT robot
Sort: SortCriterion.Relevance
root: ./pdf_files/ferroptosis-2023-03-17-00 dirs: [] files: ['Ferroptosis as a Biological Phase Transition I_ avascular and vascular tumor growth.pdf', 'DCcov_ Repositioning of Drugs and Drug Combinations for SARS-CoV-2 Infected Lung through Constraint-Based Modelling.pdf']
max_font_sizes [14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 15.960000038146973, 15.960000038146973]
section_page_dict {'Abstract': 0}
0 Abstract 0
start_page, end_page: 0 13
Traceback (most recent call last):
  File "chat_paper.py", line 468, in <module>
    main(args=args)
  File "chat_paper.py", line 433, in main
    paper_list.append(Paper(path=os.path.join(root, filename)))
  File "/home/cyril/git/ChatPaper/get_paper_from_pdf.py", line 17, in __init__
    self.parse_pdf()
  File "/home/cyril/git/ChatPaper/get_paper_from_pdf.py", line 33, in parse_pdf
    self.section_text_dict.update({"paper_info": self.get_paper_info()})
  File "/home/cyril/git/ChatPaper/get_paper_from_pdf.py", line 42, in get_paper_info
    introduction_text = self.section_text_dict['Introduction']
KeyError: 'Introduction'
kaixindelele commented 1 year ago

如果是最新代码的话,我立刻重新跑一下,不好意思!很可能有没发现的bug!

kaixindelele commented 1 year ago

大家好,我知道这个问题出现在哪儿了,是因为这几篇文章的introduction的前面没有数字序号!导致没有解析到introduction章节!这个算是我们的逻辑疏漏,但是暂时先不管了,大家等等远方的GPT4吧,哈哈

william-swl commented 1 year ago

请问有没有可能允许用户自定义不同章节的匹配模式呢?例如正则表达式等 因为发现我的文献库里有很多不带introduction章节标题,或就算带了也没有数字序号的

kaixindelele commented 1 year ago

可以做,但是目前没时间去精心调整。这个确实是没有技术难度的