jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.15k stars 3.35k forks source link

file: based urls do not work on windows #4613

Closed jankatins closed 6 years ago

jankatins commented 6 years ago

I try to make pypandoc useable with file based URLs: https://github.com/bebraw/pypandoc/pull/157.

According to the tests in https://github.com/bebraw/pypandoc/pull/157 they work on linux but not on windows. I can't really debug that further because I do not have a windows system, but that's what appveyor gives me: https://ci.appveyor.com/project/bebraw/pypandoc/build/job/jlbxo137luujc4cm

# printed file: based url
file:///C:/users/appveyor/appdata/local/temp/1/tmpxy_wob.md
[...]
# python exception because of pandoc failure
RuntimeError: Pandoc died with exitcode "1" during conversion: pandoc: /C:/users/appveyor/appdata/local/temp/1/tmpxy_wob.md: openBinaryFile: invalid argument (Invalid argument)

During the codepath (in the call to _identify_path), that file url was converted to a normal path and validated that the file exists.

(original addition of file: urls in pandoc: https://github.com/jgm/pandoc/issues/3196)

mb21 commented 6 years ago

Well, what exact pandoc command does pypandoc call that’s failing? And with what input? Also: pandoc version?

link2xt commented 6 years ago

Looks like you tried to open /C:/users/appveyor/..., instead of C:/users/appveyor/.... Linux does not care about extra slashes in the beginning of absolute filenames, while Windows obviously does, because of that slash before C:.

jankatins commented 6 years ago

The call should be pandoc --from=markdown --to=rst file:///C:/users/appveyor/appdata/local/temp/1/tmpxy_wob.md (which is a correct URL on windows according to https://blogs.msdn.microsoft.com/ie/2006/12/06/file-uris-in-windows/). On linux it calls it with --to=file:///tmp/tmp6_jab5gc.md and that works.

Pandoc is 2.2 downloaded from githubs msi file.

jankatins commented 6 years ago

According to MS that is not a correct URL:

For the local Windows file path C:\Documents and Settings\davris\FileSchemeURIs.doc The corresponding valid file URI in Windows is: file:///C:/Documents%20and%20Settings/davris/FileSchemeURIs.doc

link2xt commented 6 years ago

Hmm, looks like a bug in pandoc then. file:///C:/users/appveyor/appdata/local should be interpreted as file://localhost/C:/users/appveyor/appdata/local. The slash after localhost is the part of URI schema, not path.

link2xt commented 6 years ago

Related issue #3196

jgm commented 6 years ago

We just try to open up the path returned by Network.URI.uriPath:

Prelude Network.URI> uriPath <$> parseURI "file:///c:/data/external/pypandoc/test.md"
Just "/c:/data/external/pypandoc/test.md"

When we're on linux, we want to treat this initial / as part of the path, or we'll (incorrectly) get a relative, not absolute path.

We could add something ad hoc so that it is stripped off on Windows (i.e., when it's followed by LETTER + COLON). Is there a better approach?

EDIT: Relevant function is readSource in Text.Pandoc.App. There might also be a similar issue with images and e.g. --self-contained. See l. 593 of Text.Pandoc.Class:

            Just u' | uriScheme u' == "file:" ->
                 readLocalFile $ dropWhile (=='/') (uriPath u')

Though, come to think of it, this looks wrong even for linux -- won't it replace an absolute path with a relative one?

mb21 commented 6 years ago

won't it replace an absolute path with a relative one?

yes:

λ> let (Just u) = parseURI "file:///foo"
λ> dropWhile (=='/') (uriPath u)
"foo"