revolunet / pypdftk

Python module to drive the awesome pdftk binary.
Other
147 stars 61 forks source link

Filename arguments with spaces don't work #31

Open aslehigh opened 5 years ago

aslehigh commented 5 years ago

If my filename contains one or more spaces (or, I suppose, any character the shell considers special), the program throws an error:

In [19]: fieldlist = pypdftk.dump_data_fields("test file.pdf")
Error: Unable to find file.
Error: Failed to open PDF file:
   test
Error: Unable to find file.
Error: Failed to open PDF file:
   file.pdf
Done.  Input errors, so no output created.
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-19-7f19f024450d> in <module>()
----> 1 fieldlist = p.dump_data_fields("test file.pdf")

/usr/local/lib/python3.5/dist-packages/pypdftk.py in dump_data_fields(pdf_path)
     94     # Or return bytes with : (will break tests)
     95     #    field_data = map(lambda x: x.split(b': ', 1), run_command(cmd, True))
---> 96     field_data = map(lambda x: x.decode("utf-8").split(': ', 1), run_command(cmd, True))
     97     fields = [list(group) for k, group in itertools.groupby(field_data, lambda x: len(x) == 1) if not k]
     98     return [dict(f) for f in fields]

/usr/local/lib/python3.5/dist-packages/pypdftk.py in run_command(command, shell)
     41 def run_command(command, shell=False):
     42     ''' run a system command and yield output '''
---> 43     p = check_output(command, shell=shell)
     44     return p.split(b'\n')
     45

/usr/local/lib/python3.5/dist-packages/pypdftk.py in check_output(*popenargs, **kwargs)
     35         if cmd is None:
     36             cmd = popenargs[0]
---> 37         raise subprocess.CalledProcessError(retcode, cmd, output=output)
     38     return output
     39

CalledProcessError: Command '/usr/bin/pdftk test file.pdf dump_data_fields' returned non-zero exit status 1

From the last line there I could see what the problem was: the filename is plopped into the command line unquoted. So I can work around the problem by including quote characters in the argument:

In [19]: fieldlist = pypdftk.dump_data_fields("'test file.pdf'")

I think this really should not be necessary. This behaviour would be confusing to new users, and makes the library more difficult to use.

After skimming through the source code, it looks like a minimal solution could be to simply add quote characters around the user-supplied filename arguments everywhere they are used. Quite likely there are other sorts of arguments that should be quoted on the command line also.

But I have another suggestion: why not just quote everything that goes on the command line? The run_command() function could take a list of items or tokens, and simply join them with a space between after adding the quote characters. Furthermore, it appears to me that the run_command() function is called with the same first argument every time. If this is true, why not include it in the function instead of requiring the caller to pass it? Obviously this would involve a lot more modification to the code, but I think it would simplify it on the whole, and potentially eliminate a whole class of possible bugs/user errors. I guess that would be a breaking change though for existing code that already adds quotes.

revolunet commented 5 years ago

Yes, indeed, this is confusing

revolunet commented 5 years ago

Tried to add quotes from the code but this doesnt seem to work :/

Curious how other libs manage this

aslehigh commented 5 years ago

For the simplest possible solution I would suggest replacing command in the check_output function call in the run_command function (currently line 43 in pypdftk.py) with map(shlex.quote, command) ... and of course add the shlex import at the top.

This would accomplish my suggestion of quoting everything (and address a whole class of potentially security-related bugs in code using this library). It would be a breaking change though for anyone already quoting the arguments in their own code. But I'm guessing maybe not too many people have run into this or you would have heard about it before now........?

tomasagjr commented 3 years ago

Tried to add quotes from the code but this doesnt seem to work :/

Curious how other libs manage this

Possibly using f strings as follows... filename = f'"{os.path.join(path, "filename with spaces.pdf")}"'

Note, however, that if your f string starts (and ends) with single quotes, then any strings (incl. key names), inside the curly braces must be in double quotes, and vice versa. Otherwise, you'll raise a SyntaxError.