pdftables / python-pdftables-api

Python library to interact with https://pdftables.com API
https://pdftables.com/api
BSD 3-Clause "New" or "Revised" License
85 stars 33 forks source link

API not working for files over 100KB #36

Open jmbanda opened 10 months ago

jmbanda commented 10 months ago

Greetings,

I am submitting a large set of files and only smaller files under 100KB are getting processed all others do not error out or provide any error message. I have adjusted the timeout parameter and this does not fix the issue.

Thanks!

StevenMaude commented 10 months ago

@jmbanda: sorry for the delayed reply, you caught us over the holiday period.

Is this still an issue? If so, is it possible to provide us with the example code and PDFs to try and reproduce the error?

jmbanda commented 10 months ago

Greetings, yes, this continues to be an issue. I can't provide the PDF as it is private, but any PDF above 100KB was failing with the following code:

import pdftables_api

c = pdftables_api.Client('my-api-key', timeout=(60, 3600))
c.xlsx('input.pdf', 'output.xlsx')

Same happens with or without the timeout parameter. We still have plenty of pages left in our paid bundle, so that is not the issue. There is no error being thrown, it just skips the documents. If we input the document on the web UI manually, it works well.

StevenMaude commented 10 months ago

Thanks; we'll add it to our issue queue and take a look, then report back (it may be a few days).

StevenMaude commented 9 months ago

Just to follow up, I've tested the code here on a fresh Ubuntu 22.04 virtual machine and can't reproduce the issue. This was using Python 3.10 that came bundled with the operating system.

I did the following:

  1. Created a virtualenv with python3 -m venv api
  2. Activated the virtualenv with source api/bin/activate to activate the virtualenv
  3. Ran pip install git+https://github.com/pdftables/python-pdftables-api.git to install the API code.
  4. Converted a test PDF named input.pdf of size 360 KB with the following code (edited to include my actual API key):

    import pdftables_api
    
    c = pdftables_api.Client('my-api-key', timeout=(60, 3600))
    c.xlsx('input.pdf', 'output.xlsx')

This produced an output Excel file named output.xlsx.

If you can give any more details about the environment in which the code was failing, we can try and reproduce further. It's tricky to fix without encountering the problem, unfortunately.