confusing html layout using pdftohtml

Myfootnotsmelly commented 1 year ago

I am using xpdf-tools-win-4.04 to execute the following command:

std_out = subprocess.check_output(["./xpdf-tools-win-4.04/bin64/pdftohtml",",input_path+'/'+pdf, xpdf_path+pdf[:-4]+'/'])

While it generates html with confusing layout like this It leads to failing to extract image and caption, how to handle the problem?

Guo-SY commented 10 months ago

Hi, I am so sorry to bother you, have you salve this issue right now? please feel free to text me, thank you very much

Myfootnotsmelly commented 10 months ago

Some functions have changed because library version updates, compared to the initial code released by the author. However, I made the following efforts to ensure the proper functioning of the program:

First make sure the version of Google Chrome and Chrome driver are compatible, and altered the 40th line in code/pdf_info.py
```
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(chrome_options=chrome_options)
```
refering to https://stackoverflow.com/questions/53073411/selenium-webdriverexceptionchrome-failed-to-start-crashed-as-google-chrome-is
I changed "selenium" version to 4.15.2
```
selenium==4.15.2
```
Afterwards, I updated the 46th line in the code/pdf_info.py
```
page_layout = browser.find_element_by_xpath("/html/body/img")
```
with
```
page_layout = browser.find_element("xpath","/html/body/ing")
```
cause Selenium just removed that method in version 4.3.0 refering to https://stackoverflow.com/questions/72754651/attributeerror-webdriver-object-has-no-attribute-find-element-by-xpath

Guo-SY commented 10 months ago

Thank you very much for your answering, it is very clear now. I have a little one more questions, in the file renderer.py I already replace the original imagemagickPath by my imagemagickPath = imagemagickPath = '/usr/pengyuan/others/ImageMagick-7.0.3-5-portable-Q16-x86/convert.exe'.

And then I always get the error: 0 1 2 3 4 5 6 7 8 9 10

IndexError Traceback (most recent call last) in <cell line: 1>() 60 61 # print(images) ---> 62 page_fig = images[page_no-1] 63 rendered_size = page_fig.size 64

IndexError: list index out of range

Do you have any idea about the solutiton? I appreciate for the assistance.

Myfootnotsmelly commented 10 months ago

Thank you very much for your answering, it is very clear now. I have a little one more questions, in the file renderer.py I already replace the original imagemagickPath by my imagemagickPath = imagemagickPath = '/usr/pengyuan/others/ImageMagick-7.0.3-5-portable-Q16-x86/convert.exe'.

And then I always get the error:

0 1 2 3 4 5 6 7 8 9 10 IndexError Traceback (most recent call last) in <cell line: 1>() 60 61 # print(images) ---> 62 page_fig = images[page_no-1] 63 rendered_size = page_fig.size 64

IndexError: list index out of range

Do you have any idea about the solutiton? I appreciate for the assistance.

it's ok for me in linux due to it taking the "else" branch, there's no need for further adjustments to the code you've mentioned. Sorry for having no idea with your question.

diazr04 commented 8 months ago

I am trying to get selenium to work but I get the error:

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/ing"}

There is no Xpath with the name html/body/ing

I tried ing and img

do you have any idea?

SohamTolwala commented 8 months ago

What are the required dependencies for the same. chromedriver, XpdfReader tools and what else?

Some functions have changed because library version updates, compared to the initial code released by the author. However, I made the following efforts to ensure the proper functioning of the program:

First make sure the version of Google Chrome and Chrome driver are compatible, and altered the 40th line in code/pdf_info.py
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(chrome_options=chrome_options)
refering to https://stackoverflow.com/questions/53073411/selenium-webdriverexceptionchrome-failed-to-start-crashed-as-google-chrome-is

I changed "selenium" version to 4.15.2
selenium==4.15.2
Afterwards, I updated the 46th line in the code/pdf_info.py
page_layout = browser.find_element_by_xpath("/html/body/img")
with
page_layout = browser.find_element("xpath","/html/body/ing")
cause Selenium just removed that method in version 4.3.0 refering to https://stackoverflow.com/questions/72754651/attributeerror-webdriver-object-has-no-attribute-find-element-by-xpath

Myfootnotsmelly commented 8 months ago

I am trying to get selenium to work but I get the error:

selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"/html/body/ing"}

There is no Xpath with the name html/body/ing

I tried ing and img

do you have any idea?

Sorry, a mistake. It should be 'img', and it works in my case.

Myfootnotsmelly commented 8 months ago

What are the required dependencies for the same. chromedriver, XpdfReader tools and what else?
Some functions have changed because library version updates, compared to the initial code released by the author. However, I made the following efforts to ensure the proper functioning of the program:

First make sure the version of Google Chrome and Chrome driver are compatible, and altered the 40th line in code/pdf_info.py
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
browser = webdriver.Chrome(chrome_options=chrome_options)
refering to https://stackoverflow.com/questions/53073411/selenium-webdriverexceptionchrome-failed-to-start-crashed-as-google-chrome-is

I changed "selenium" version to 4.15.2
selenium==4.15.2
Afterwards, I updated the 46th line in the code/pdf_info.py
page_layout = browser.find_element_by_xpath("/html/body/img")
with
page_layout = browser.find_element("xpath","/html/body/ing")
cause Selenium just removed that method in version 4.3.0 refering to https://stackoverflow.com/questions/72754651/attributeerror-webdriver-object-has-no-attribute-find-element-by-xpath

version of xpdftools is 4.04, google-chrome is 89.0.4389.114, and version of chromedriver close to google-chrome is ok.

diazr04 commented 5 months ago

Can someone share a working code? I tried and I got the Image but the colors are different (like the negatice of a photo) and no axis in the image. Can someone help me? 2_1

diazr04 commented 5 months ago

@Myfootnotsmelly Hello. Do you have any idea on my code?

Myfootnotsmelly commented 5 months ago

@Myfootnotsmelly Hello. Do you have any idea on my code?

sry your code is not available yet. Btw, some errors also appear when i use this repo, such as failure to capture figure and some confusing layout and etc, maybe your issue is the function this repo could not support.

diazr04 commented 4 weeks ago

Thanks for your message, I created a wrapper that actually combines this code with another (figsplit). Thanks anyway.

pengyuanli / PDFigCapX

confusing html layout using pdftohtml #4

And then I always get the error: 0 1 2 3 4 5 6 7 8 9 10

And then I always get the error: