regex errors when parsing image targets

mkoertgen commented 8 years ago

Running the current mkdocs2pandoc 0.2.4 gives me this error:

Traceback (most recent call last):
  File "C:\python\Scripts\mkdocs2pandoc-script.py", line 9, in <module>
    load_entry_point('mkdocs-pandoc==0.2.4', 'console_scripts', 'mkdocs2pandoc')
()
  File "C:\python\lib\site-packages\mkdocs_pandoc\cli\mkdocs2pandoc.py", line 80
, in main
    for line in pconv.convert():
  File "C:\python\lib\site-packages\mkdocs_pandoc\pandoc_converter.py", line 139
, in convert
    lines_tmp = f_image.run(lines_tmp)
  File "C:\python\lib\site-packages\mkdocs_pandoc\filters\images.py", line 75, i
n run
    '![%s](%s)' % (alt, img_name), line)
  File "C:\python\lib\re.py", line 179, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\python\lib\re.py", line 326, in _subx
    template = _compile_repl(template, pattern)
  File "C:\python\lib\re.py", line 313, in _compile_repl
    p = sre_parse.parse_template(repl, pattern)
  File "C:\python\lib\sre_parse.py", line 812, in parse_template
    raise error("missing group name")
sre_constants.error: missing group name

which refers to images.py#L75.

According to this SO answer the problem might be related with either the image name or the alternative caption containing special characters that might need escaping before feeding into regex.

I am not so much into python but would be willing to work on this issue. Please let me know.

jgrassler commented 8 years ago

I'd probably have a hard time replicating this, so I'd appreciate if you could look into it, yes. Thanks! If you need help anywhere, just comment on this issue and/or send me an email. Just submit a pull request if you've got a working fix and I'll take a look at it.

Alternatively,you could provide me with the troublesome mkdocs documentation (if it's something you can share, that is) and/or a minimum example that provokes the problem and I can have a go at it.

mkoertgen commented 8 years ago

I added a quick print to see the values fed into the regex.

It's simple: img_name is calculated as the absolute file name. On Windows, img_name then typically contains some \ which need to be handled.

I didn't specify alt anywhere, so it's always empty.

Anyway, i added re.escape() for both arguments. Now i'm watching python running for about 15 minutes, consuming 1.5GB and - until now - not writing a single character to output.

It seems that escaping just adds up \ to the regex so the while loop never terminates. I'l get back to you, ;-)

mkoertgen commented 8 years ago

I ended up replacing then Windows path separator. This works fine for me but adds a little unnecessary extra work on Unix systems, cf.: https://github.com/mkoertgen/mkdocs-pandoc/commit/5b5888788a0835a594d8e285a4504703d557ebc4

jgrassler commented 8 years ago

The path separator adjustment code looks fine, apart from the inline comment I made regarding the debug statement. I don't mind that little bit of extra code, especially if that's all it takes to port mkdocs-pandoc to Windows. Did you get the expected Pandoc output with that fix in place? If so, and if you remove the debug statement I'm happy to merge that commit.

jgrassler commented 8 years ago

As for the changes in README.md: I'd prefer to retain the fenced code blocks the way they are. While that syntax is not supported in all markdown dialects, it is in most and it is more resilient against editing errors (tabs vs. spaces can break things with indented blocks).

I'm happy to include the --proxy thing and anything else I may have missed, though :-)

mkoertgen commented 8 years ago

Ah ok, good point. I was not aware of that. You should see the PR updated with https://github.com/mkoertgen/mkdocs-pandoc/commit/d27393900c8641ecb8150cc54e41efb3e46703f6

Yes, generated .pd looks quite fine. So i think Windows compat. seems good.

One point was that i needed to insert some page breaks here and there to make the output looking good. I didn't fully groke the code but i guess that flattening the pages from mkdocs.yml and adding headers might need a preceeding extra line break in some situations, probably here chapterhead.py#L29. This worked for me

    head = ['\n' + ('#' * self.headlevel) + ' ' + self.title, '']

Also, when generating a pdf it did not comprise the whole toc. Skipping the head level filter kind of fixed this for me pandoc_converter.py#L137

But this should probably go into another issue. I would be happy to help out there if i can.

Update: This seems more to be a pandoc issue. The output generated by mkdocs2pandoc looks quite good.

jgrassler commented 8 years ago

Ok, the pull request looks good now, thanks! One last thing: Could you rebase your topic branch down to just one commit so the commit log doesn not get so crowded? Once that's done I'll happily merge it.

Good to hear that Windows compatibility is almost there :-)

As for the extra newlines thing, yeah, that's definitely another bug. Let's create one and look at it in detail over there...

jgrassler commented 8 years ago

Merged, Thanks!

I'll wait with cutting a new release until the missing linebreaks problem is resolved. Can you open an issue for that?

mkoertgen commented 8 years ago

Sure. I will try to reproduce this in a clean way when opening the issue. Might take some time, though.

jgrassler commented 8 years ago

No worries. Thanks!

jgrassler / mkdocs-pandoc

regex errors when parsing image targets #5