jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.94k stars 3.35k forks source link

--embed-resources cause invalid stream error #10110

Closed fujohnwang closed 3 weeks ago

fujohnwang commented 3 weeks ago

Explain the problem.

Firstly, I run command: pandoc -s --toc --toc-depth=4 --self-contained -c ~/FuqiangWorks/templates/pandoc/style.css -A ~/FuqiangWorks/templates/pandoc/footer.html "posts/2024-08-22-shop-crowd.md" > "posts/2024-08-22-shop-crowd.md.html", and get error:

pandoc: Cannot decode byte '\xff': Data.Text.Encoding: Invalid UTF-8 stream

when I changed to new parameter as suggested:

pandoc -s --toc --toc-depth=4 --embed-resources --standalone -c ~/FuqiangWorks/templates/pandoc/style.css -A ~/FuqiangWorks/templates/pandoc/footer.html -f markdown -t html "posts/2024-08-22-shop-crowd.md" > "posts/2024-08-22-shop-crowd.md.html"

Same error.

I asked chatgpt and other GPT models and searched thru google, found nothing helpful.

So I had to debug by removing parameters one by one(check file encoding also before that):

(base) LuckyJohn💫 ➜  afoo.me git:(master) ✗ pandoc  "posts/2024-08-22-shop-crowd.md" > "posts/2024-08-22-shop-crowd.md.html"
(base) LuckyJohn💫 ➜  afoo.me git:(master) ✗ pandoc -s "posts/2024-08-22-shop-crowd.md" > "posts/2024-08-22-shop-crowd.md.html"
(base) LuckyJohn💫 ➜  afoo.me git:(master) ✗ pandoc -s --embed-resources "posts/2024-08-22-shop-crowd.md" > "posts/2024-08-22-shop-crowd.md.html"
pandoc: Cannot decode byte '\xff': Data.Text.Encoding: Invalid UTF-8 stream

So, I think this is a bug.

This parameter cause blocking issue.

Pandoc version?

I first run pandoc with 2.14.2, but even upgrade to newest, still not helpful.

(base) LuckyJohn💫 ➜  afoo.me git:(master) ✗ pandoc -v
pandoc 3.3
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/fq/.local/share/pandoc
Copyright (C) 2006-2024 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
jgm commented 3 weeks ago

We can't really test this without the input markdown file. Could you upload it? (or a reduced, anonymized version that is still sufficient to create the problem).

fujohnwang commented 3 weeks ago

2024-08-22-shop-crowd.md

Sure, This is the source file.

And my os info:

image
jgm commented 3 weeks ago

Thank you. Here is a minimal way to reproduce the issue:

% pandoc --embed-resources
![](https://images.afoo.me/file/d8e6c3d189873e64a8cb3.jpg)
^D
pandoc: Cannot decode byte '\xff': Data.Text.Encoding: Invalid UTF-8 stream
jgm commented 3 weeks ago

This seems to be a problem with the way the image is being served by the web server. curl --verbose shows us the header:

< HTTP/2 200 
< date: Fri, 23 Aug 2024 17:40:00 GMT
< content-type: text/plain;charset=UTF-8
< cf-ray: 8b7ce29e2c46969b-SJC
< cf-cache-status: HIT
< access-control-allow-origin: https://afoo.me
< age: 1986
< cache-control: max-age=14400
< last-modified: Fri, 23 Aug 2024 17:06:54 GMT
< vary: Origin
< access-control-allow-credentials: true
< cf-placement: local-SJC

Note that the MIME type is being reported as text/plain;charset=UTF-8, when it should be image/jpeg. Pandoc thus tries to treat this content as plain UTF-8 encoded text, and that's why an error is raised.

jgm commented 3 weeks ago

closing this as it is not a bug

fujohnwang commented 2 weeks ago

This seems to be a problem with the way the image is being served by the web server. curl --verbose shows us the header: ...

I had fixed the wrong mime issue from remote endpoint, and return correct image/jpeg content type, but still get same error 😅

jgm commented 2 weeks ago

Works fine for me now. Perhaps you were getting a cached version.

fujohnwang commented 2 weeks ago

OK, I will try later, thanks a million

fujohnwang commented 2 weeks ago

It works~ 🫰🫰🫰