Open abhishek7997 opened 1 year ago
Hi @abhishek7997 thank you for the feedback! I'm not totally sure what could be happening here 🤔 Could it be possible for you to provide some example code to try to reproduce this is our side?
Pinging @ccordoba12 and @impact27 (maybe they could have some ideas how to further debug this)
Any other info in order to be able to reproduce this is greatly appreticated. Let us know!
Hi @abhishek7997 thank you for the feedback! I'm not totally sure what could be happening here 🤔 Could it be possible for you to provide some example code to try to reproduce this is our side?
Pinging @ccordoba12 and @impact27 (maybe they could have some ideas how to further debug this)
Any other info in order to be able to reproduce this is greatly appreticated. Let us know!
In my case, here is the code setup that I work with. The debugger first freezes, then crashes. Also, I don't think it is specific to this code. I have used a different environment in spyder (used miniconda to create new environment).
main.py:
from constants import CONSTANTS
from file_utils import FileUtils
ITEMS= {
'text0': "./papers/research-paper.pdf"
}
for text in ITEMS.values():
pages = FileUtils.extract_paragraphs(currentText) # <- I have added breakpoint here
# rest of the code
utils.py:
import re
from io import StringIO
from pdfminer.high_level import extract_text
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTTextBox
from text_utils import TextUtils
from typing import List
class FileUtils:
@classmethod
def extract_paragraphs(cls, file_name: str):
paragraphs = []
with open(file_name, 'rb') as file:
resource_manager = PDFResourceManager()
output_stream = StringIO()
codec = 'utf-8'
laparams = LAParams()
converter = TextConverter(resource_manager, output_stream, codec=codec, laparams=laparams)
interpreter = PDFPageInterpreter(resource_manager, converter)
for page in PDFPage.get_pages(file, check_extractable=True):
interpreter.process_page(page)
extracted_text = output_stream.getvalue()
# Sanitize the extracted text
sanitized_text = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]', ' ', extracted_text)
print(sanitized_text) # <- I have added breakpoint here
paragraphs.extend(sanitized_text.splitlines())
output_stream.truncate(0)
output_stream.seek(0)
converter.close()
output_stream.close()
return paragraphs
@abhishek7997, unfortunately your code is not reproducible because it depends on PDFResourceManager
, LAParams
, etc, which are not imported from anywhere in your utils.py
module.
I removed the import statements before. The library I am using is pdfminer.six Here are the import statements of file_utils.py:-
import re from io import StringIO
from pdfminer.high_level import extract_text from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.pdfparser import PDFParser from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer, LTTextBox
from text_utils import TextUtils
from typing import List
Description
Debugging a python program in spyder IDE using pdb. pdb crashes with the following error:- Windows fatal exception: access violation
What steps will reproduce the problem?
No idea. I am just setting breakpoints in my main file and in another file which is imported in the main file. pdb crashes with the error:- Windows fatal exception: access violation
Traceback
Another error
Versions
Dependencies