Closed Hessen525 closed 1 year ago
这种问题,比较尴尬了,一般来说论文都会有introduction的章节的~能贴一下你的论文名称么?我看看是什么诡异的文章。 另外暂时我们还没有时间考虑这种极少数的情况,后面等gpt4出来,我们直接长文本喂入,应该可以直接解决这个问题~
@kaixindelele COMMUNICATIONS OF THE ACM 里的文章都没有, ex:https://cacm.acm.org/magazines/2023/3/270210-ai-and-neurotechnology/fulltext
Nature有些子刊也没有Introduction
放了20篇全都有Introduction的文章,但是也出现了这个“Introduction”的报错。。。看报错代码没看出来是具体哪一篇识别有问题。
+1 同出现这个问题,在用的
python chat_paper.py --query "ferroptosis" --filter_keys "ferroptosis" --max_results 5
时候是这样的反馈:
Key word: reinforcement learning
Query: ferroptosis
Sort: SortCriterion.Relevance
all search:
0 Ferroptosis as a Biological Phase Transition I: avascular and vascular tumor growth 2021-08-26 18:35:46+00:00
1 DCcov: Repositioning of Drugs and Drug Combinations for SARS-CoV-2 Infected Lung through Constraint-Based Modelling 2021-03-25 13:51:37+00:00
filter_keys: ferroptosis
筛选后剩下的论文数量:
filter_results: 2
filter_papers:
0 Ferroptosis as a Biological Phase Transition I: avascular and vascular tumor growth 2021-08-26 18:35:46+00:00
1 DCcov: Repositioning of Drugs and Drug Combinations for SARS-CoV-2 Infected Lung through Constraint-Based Modelling 2021-03-25 13:51:37+00:00
All_paper: 2
paper_path: ./pdf_files/ferroptosis-2023-03-17-00/Ferroptosis as a Biological Phase Transition I_ avascular and vascular tumor growth.pdf
section_page_dict {'Abstract': 0}
0 Abstract 0
download_error: 'Introduction'
paper_path: ./pdf_files/ferroptosis-2023-03-17-00/DCcov_ Repositioning of Drugs and Drug Combinations for SARS-CoV-2 Infected Lung through Constraint-Based Modelling.pdf
section_page_dict {'Abstract': 0}
0 Abstract 0
download_error: 'Introduction'
summary time: 33.19153451919556
全部都是download_error: 'Introduction' 然后直接把pdf文件夹喂给它:
python chat_paper.py --pdf_path "./pdf_files/ferroptosis-2023-03-16-23"
依然是'Introduction'错误,这次是这样:
Key word: reinforcement learning
Query: all: ChatGPT robot
Sort: SortCriterion.Relevance
root: ./pdf_files/ferroptosis-2023-03-17-00 dirs: [] files: ['Ferroptosis as a Biological Phase Transition I_ avascular and vascular tumor growth.pdf', 'DCcov_ Repositioning of Drugs and Drug Combinations for SARS-CoV-2 Infected Lung through Constraint-Based Modelling.pdf']
max_font_sizes [14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 14.039999961853027, 15.960000038146973, 15.960000038146973]
section_page_dict {'Abstract': 0}
0 Abstract 0
start_page, end_page: 0 13
Traceback (most recent call last):
File "chat_paper.py", line 468, in <module>
main(args=args)
File "chat_paper.py", line 433, in main
paper_list.append(Paper(path=os.path.join(root, filename)))
File "/home/cyril/git/ChatPaper/get_paper_from_pdf.py", line 17, in __init__
self.parse_pdf()
File "/home/cyril/git/ChatPaper/get_paper_from_pdf.py", line 33, in parse_pdf
self.section_text_dict.update({"paper_info": self.get_paper_info()})
File "/home/cyril/git/ChatPaper/get_paper_from_pdf.py", line 42, in get_paper_info
introduction_text = self.section_text_dict['Introduction']
KeyError: 'Introduction'
如果是最新代码的话,我立刻重新跑一下,不好意思!很可能有没发现的bug!
大家好,我知道这个问题出现在哪儿了,是因为这几篇文章的introduction的前面没有数字序号!导致没有解析到introduction章节!这个算是我们的逻辑疏漏,但是暂时先不管了,大家等等远方的GPT4吧,哈哈
请问有没有可能允许用户自定义不同章节的匹配模式呢?例如正则表达式等 因为发现我的文献库里有很多不带introduction章节标题,或就算带了也没有数字序号的
可以做,但是目前没时间去精心调整。这个确实是没有技术难度的
Traceback (most recent call last): File "C:\Users\admin\Documents\GitHub\ChatPaper\chat_paper.py", line 471, in
main(args=args)
File "C:\Users\admin\Documents\GitHub\ChatPaper\chat_paper.py", line 436, in main
paper_list.append(Paper(path=os.path.join(root, filename)))
File "C:\Users\admin\Documents\GitHub\ChatPaper\get_paper_from_pdf.py", line 17, in init
self.parse_pdf()
File "C:\Users\admin\Documents\GitHub\ChatPaper\get_paper_from_pdf.py", line 33, in parse_pdf
self.section_text_dict.update({"paper_info": self.get_paper_info()})
File "C:\Users\admin\Documents\GitHub\ChatPaper\get_paper_from_pdf.py", line 42, in get_paper_info
introduction_text = self.section_text_dict['Introduction']
KeyError: 'Introduction'