gotson / komga

Media server for comics/mangas/BDs/magazines/eBooks with API, OPDS and Kobo Sync support
https://komga.org
MIT License
3.92k stars 233 forks source link

Epub files are not always parsed correctly #556

Closed steve1977 closed 3 years ago

steve1977 commented 3 years ago

Moving from Discord to GH request as you had suggested.

I have embedded metadata in my epub3 files that are being created by Calibre. Calibre developer claims to comply to epub3 specifications.

As shared over Discord, the series tag is not picked up by Komga. Also, the book summary starts and ends with <div> <p> in Komga.

Below the full opf content for your info.

<package xmlns="http://www.idpf.org/2007/opf" version="3.0" unique-identifier="uuid_id" prefix="calibre: https://calibre-ebook.com">
  <metadata xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:calibre="http://calibre.kovidgoyal.net/2009/metadata">
    <dc:title id="id">Panik im Paradies</dc:title>
    <dc:creator id="id-1">Ulf Blanck</dc:creator>
    <dc:identifier>goodreads:222735</dc:identifier>
    <dc:identifier>isbn:9783440077894</dc:identifier>
    <dc:identifier>calibre:255</dc:identifier>
    <dc:identifier>uuid:499def46-39dc-4e79-b474-d0ec12ea5dc5</dc:identifier>
    <dc:identifier id="uuid_id">uuid:499def46-39dc-4e79-b474-d0ec12ea5dc5</dc:identifier>
    <dc:language>de</dc:language>
    <dc:date>1999-07-31T16:00:00+00:00</dc:date>
    <dc:description>&lt;div&gt;
&lt;p&gt;Bereits im ersten Band "Panik im Paradies" machen die drei berühmten Detektive ihrem Namen alle Ehre. Eigentlich haben sie ja gerade Ferien. Doch dann treffen sie auf diesen schrulligen Kapitän Larsson, der sich einen kleinen Privatzoo mit exotischen Tieren hält. Als plötzlich alle Tiere an rätselhaften Infektionen erkranken und die Besucher ausbleiben, werden Justus, Peter und Bob neugierig. Schon bald merken sie, daß da jemand ein düsteres Geheimnis hütet...&lt;/p&gt;&lt;/div&gt;</dc:description>
    <dc:publisher>Kosmos</dc:publisher>
    <dc:subject>Kinder- und Jugendbücher</dc:subject>
    <opf:meta refines="#id" property="title-type">main</opf:meta>
    <opf:meta refines="#id" property="file-as">Panik im Paradies</opf:meta>
    <meta name="cover" content="cover"/>
    <meta property="calibre:timestamp" scheme="dcterms:W3CDTF">2020-08-09T08:40:58Z</meta>
    <meta property="dcterms:modified" scheme="dcterms:W3CDTF">2021-06-19T08:20:33Z</meta>
    <opf:meta refines="#id-1" property="role" scheme="marc:relators">aut</opf:meta>
    <opf:meta refines="#id-1" property="file-as">Blanck, Ulf</opf:meta>
    <opf:meta property="calibre:rating">6</opf:meta>
    <opf:meta property="belongs-to-collection" id="id-2">Die drei ??? Kids</opf:meta>
    <opf:meta refines="#id-2" property="collection-type">series</opf:meta>
    <opf:meta refines="#id-2" property="group-position">1</opf:meta>
    <opf:meta property="calibre:author_link_map">{"Ulf Blanck": ""}</opf:meta>
  </metadata>
  <manifest>
    <item id="titlepage" href="titlepage.xhtml" media-type="application/xhtml+xml" properties="svg calibre:title-page"/>
    <item id="TableOfContents_html" href="OPS/TableOfContents.html" media-type="application/xhtml+xml"/>
    <item id="section-0001_html" href="OPS/section-0001.html" media-type="application/xhtml+xml"/>
    <item id="section-0002_html" href="OPS/section-0002.html" media-type="application/xhtml+xml"/>
    <item id="section-0003_html" href="OPS/section-0003.html" media-type="application/xhtml+xml"/>
    <item id="section-0004_html" href="OPS/section-0004.html" media-type="application/xhtml+xml"/>
    <item id="section-0005_html" href="OPS/section-0005.html" media-type="application/xhtml+xml"/>
    <item id="section-0006_html" href="OPS/section-0006.html" media-type="application/xhtml+xml"/>
    <item id="section-0007_html" href="OPS/section-0007.html" media-type="application/xhtml+xml"/>
    <item id="section-0008_html" href="OPS/section-0008.html" media-type="application/xhtml+xml"/>
    <item id="section-0009_html" href="OPS/section-0009.html" media-type="application/xhtml+xml"/>
    <item id="section-0010_html" href="OPS/section-0010.html" media-type="application/xhtml+xml"/>
    <item id="section-0011_html" href="OPS/section-0011.html" media-type="application/xhtml+xml"/>
    <item id="section-0012_html" href="OPS/section-0012.html" media-type="application/xhtml+xml"/>
    <item id="section-0013_html" href="OPS/section-0013.html" media-type="application/xhtml+xml"/>
    <item id="section-0014_html" href="OPS/section-0014.html" media-type="application/xhtml+xml"/>
    <item id="section-0015_html" href="OPS/section-0015.html" media-type="application/xhtml+xml"/>
    <item id="section-0016_html" href="OPS/section-0016.html" media-type="application/xhtml+xml"/>
    <item id="section-0017_html" href="OPS/section-0017.html" media-type="application/xhtml+xml"/>
    <item id="section-0018_html" href="OPS/section-0018.html" media-type="application/xhtml+xml"/>
    <item id="section-0019_html" href="OPS/section-0019.html" media-type="application/xhtml+xml"/>
    <item id="section-0020_html" href="OPS/section-0020.html" media-type="application/xhtml+xml"/>
    <item id="section-0021_html" href="OPS/section-0021.html" media-type="application/xhtml+xml"/>
    <item id="section-0022_html" href="OPS/section-0022.html" media-type="application/xhtml+xml"/>
    <item id="section-0023_html" href="OPS/section-0023.html" media-type="application/xhtml+xml"/>
    <item id="nav" href="nav.xhtml" media-type="application/xhtml+xml" properties="nav"/>
    <item id="page_css" href="page_styles.css" media-type="text/css"/>
    <item id="css" href="stylesheet.css" media-type="text/css"/>
    <item id="cover" href="cover.jpeg" media-type="image/jpeg" properties="cover-image"/>
    <item id="image0_jpg" href="OPS/image0.jpg" media-type="image/jpeg"/>
    <item id="image1_jpg" href="OPS/image1.jpg" media-type="image/jpeg"/>
    <item id="image10_jpg" href="OPS/image10.jpg" media-type="image/jpeg"/>
    <item id="image11_jpg" href="OPS/image11.jpg" media-type="image/jpeg"/>
    <item id="image12_jpg" href="OPS/image12.jpg" media-type="image/jpeg"/>
    <item id="image13_jpg" href="OPS/image13.jpg" media-type="image/jpeg"/>
    <item id="image14_jpg" href="OPS/image14.jpg" media-type="image/jpeg"/>
    <item id="image15_jpg" href="OPS/image15.jpg" media-type="image/jpeg"/>
    <item id="image16_jpg" href="OPS/image16.jpg" media-type="image/jpeg"/>
    <item id="image17_jpg" href="OPS/image17.jpg" media-type="image/jpeg"/>
    <item id="image18_jpg" href="OPS/image18.jpg" media-type="image/jpeg"/>
    <item id="image19_jpg" href="OPS/image19.jpg" media-type="image/jpeg"/>
    <item id="image2_jpg" href="OPS/image2.jpg" media-type="image/jpeg"/>
    <item id="image20_jpg" href="OPS/image20.jpg" media-type="image/jpeg"/>
    <item id="image21_jpg" href="OPS/image21.jpg" media-type="image/jpeg"/>
    <item id="image22_jpg" href="OPS/image22.jpg" media-type="image/jpeg"/>
    <item id="image23_jpg" href="OPS/image23.jpg" media-type="image/jpeg"/>
    <item id="image24_jpg" href="OPS/image24.jpg" media-type="image/jpeg"/>
    <item id="image25_jpg" href="OPS/image25.jpg" media-type="image/jpeg"/>
    <item id="image26_jpg" href="OPS/image26.jpg" media-type="image/jpeg"/>
    <item id="image27_jpg" href="OPS/image27.jpg" media-type="image/jpeg"/>
    <item id="image28_jpg" href="OPS/image28.jpg" media-type="image/jpeg"/>
    <item id="image29_jpg" href="OPS/image29.jpg" media-type="image/jpeg"/>
    <item id="image3_jpg" href="OPS/image3.jpg" media-type="image/jpeg"/>
    <item id="image30_jpg" href="OPS/image30.jpg" media-type="image/jpeg"/>
    <item id="image31_jpg" href="OPS/image31.jpg" media-type="image/jpeg"/>
    <item id="image32_jpg" href="OPS/image32.jpg" media-type="image/jpeg"/>
    <item id="image33_jpg" href="OPS/image33.jpg" media-type="image/jpeg"/>
    <item id="image34_jpg" href="OPS/image34.jpg" media-type="image/jpeg"/>
    <item id="image35_jpg" href="OPS/image35.jpg" media-type="image/jpeg"/>
    <item id="image4_jpg" href="OPS/image4.jpg" media-type="image/jpeg"/>
    <item id="image5_jpg" href="OPS/image5.jpg" media-type="image/jpeg"/>
    <item id="image6_jpg" href="OPS/image6.jpg" media-type="image/jpeg"/>
    <item id="image7_jpg" href="OPS/image7.jpg" media-type="image/jpeg"/>
    <item id="image8_jpg" href="OPS/image8.jpg" media-type="image/jpeg"/>
    <item id="image9_jpg" href="OPS/image9.jpg" media-type="image/jpeg"/>
  </manifest>
  <spine>
    <itemref idref="titlepage"/>
    <itemref idref="TableOfContents_html"/>
    <itemref idref="section-0001_html"/>
    <itemref idref="section-0002_html"/>
    <itemref idref="section-0003_html"/>
    <itemref idref="section-0004_html"/>
    <itemref idref="section-0005_html"/>
    <itemref idref="section-0006_html"/>
    <itemref idref="section-0007_html"/>
    <itemref idref="section-0008_html"/>
    <itemref idref="section-0009_html"/>
    <itemref idref="section-0010_html"/>
    <itemref idref="section-0011_html"/>
    <itemref idref="section-0012_html"/>
    <itemref idref="section-0013_html"/>
    <itemref idref="section-0014_html"/>
    <itemref idref="section-0015_html"/>
    <itemref idref="section-0016_html"/>
    <itemref idref="section-0017_html"/>
    <itemref idref="section-0018_html"/>
    <itemref idref="section-0019_html"/>
    <itemref idref="section-0020_html"/>
    <itemref idref="section-0021_html"/>
    <itemref idref="section-0022_html"/>
    <itemref idref="section-0023_html"/>
  </spine>
</package>
gotson commented 3 years ago

Also, the book summary starts and ends with "

" in Komga.

Can you clarify this part? Does it start and end with a " character, or with a newline?

Can you get the raw json object from /api/v1/series/<seriesId> so it's clearer to me?

The description field in your epub contains html <div><p> which may be the problem.

steve1977 commented 3 years ago

Here we go: https://pastebin.com/BZRaHEJv

steve1977 commented 3 years ago

Oh, sorry... The comment was incomplete. It starts with <div> <p> and it ends with </p></div>.

I pulled another ebook and this one starts with <div><div><font face="MS Shell Dlg 2, sans-serif"><span style="font-size: 14px;">.

All my comics (crz) show well, but my calibre tagged epub3 files do not.

steve1977 commented 3 years ago

Oh... Not sure why, but my comments don't show up. Let me try without the "" below:

steve1977 commented 3 years ago

Doesn't work. See on pastebin: https://pastebin.com/CTiZGL2B

gotson commented 3 years ago

All my comics (crz) show well, but my calibre tagged epub3 files do not.

Can you clarify what "show well" means for you? Are you referring to the newline character?

gotson commented 3 years ago

Can you take a screenshot and post it here?

gotson commented 3 years ago

Oh... Not sure why, but my comments don't show up. Let me try without the "" below:

You need to enclose html inside backticks "`" so they get rendered properly.

steve1977 commented 3 years ago

Comics (cbz): Collection is identified from tag and I can sort comics by the collection. When selecting an individual comic file, it show on the top the collection and then the title. And underneath it shows the summary for the comics. And yet underneath it shows the writers, pencillers, etc. All nice and perfect!

Ebooks (epub): The series is not picked up from tag, but it uses the folder name instead. Also, the summary of the book doesn't show up nicely, but has some <<>> things (see above). And the title image also doesn't show up. And for some ebooks, the authors are shown as two pieces. Will upload screenshots.

steve1977 commented 3 years ago

The one we are troubleshooting so far.

image

Another one with similar issue including a wrongly picked up writers (should be "Christian Kracht" instead of "Christian" and "Kracht" as two writers.

image

gotson commented 3 years ago

the title image also doesn't show up

What does that mean exactly?

gotson commented 3 years ago

wrongly picked up writers (should be "Christian Kracht" instead of "Christian" and "Kracht" as two writers.

Can you post the opf file for that book?

steve1977 commented 3 years ago

the title image also doesn't show up

What does that mean exactly?

The cover image of the ebook. You will see this from the two screenshots. The title page of the book doesn't show, but instead some other photo. Do you know what I mean?

gotson commented 3 years ago

the title image also doesn't show up

What does that mean exactly?

The cover image of the ebook. You will see this from the two screenshots. The title page of the book doesn't show, but instead some other photo. Do you know what I mean?

Is that a ebook or a comic? Komga doesn't handle ebooks, only comics in epub format, meaning only images are processed.

steve1977 commented 3 years ago

wrongly picked up writers (should be "Christian Kracht" instead of "Christian" and "Kracht" as two writers.

Can you post the opf file for that book?

Here we go with the embedded file attached (not the one in the same folder, but the one that is embedded in the epub file).

I had to zip it up for GH to allow me to upload it.

content.zip

steve1977 commented 3 years ago

the title image also doesn't show up

What does that mean exactly?

The cover image of the ebook. You will see this from the two screenshots. The title page of the book doesn't show, but instead some other photo. Do you know what I mean?

Is that a ebook or a comic? Komga doesn't handle ebooks, only comics in epub format, meaning only images are processed.

It's an ebook. I am not planning to use Komga as a reader, so I'd assume it should handle even ebooks fine for me (in epub3 formats with tags?). The epub file includes a title page as cover.jpeg. Also attaching this below (embedded from the epiub file).

cover

gotson commented 3 years ago

wrongly picked up writers (should be "Christian Kracht" instead of "Christian" and "Kracht" as two writers.

Can you post the opf file for that book?

Here we go with the embedded file attached (not the one in the same folder, but the one that is embedded in the epub file).

I had to zip it up for GH to allow me to upload it.

content.zip

You may have posted the wrong one, this seems to be the first one in your screenshots where the author is Ulf Blank and correctly shows as per your screenshot.

gotson commented 3 years ago

the title image also doesn't show up

What does that mean exactly?

The cover image of the ebook. You will see this from the two screenshots. The title page of the book doesn't show, but instead some other photo. Do you know what I mean?

Is that a ebook or a comic? Komga doesn't handle ebooks, only comics in epub format, meaning only images are processed.

It's an ebook. I am not planning to use Komga as a reader, so I'd assume it should handle even ebooks fine for me (in epub3 formats with tags?). The epub file includes a title page as cover.jpeg. Also attaching this below (embedded from the epiub file).

cover

Without the opf file I can't say which image will be picked up as the cover.

steve1977 commented 3 years ago

wrongly picked up writers (should be "Christian Kracht" instead of "Christian" and "Kracht" as two writers.

Can you post the opf file for that book?

Here we go with the embedded file attached (not the one in the same folder, but the one that is embedded in the epub file). I had to zip it up for GH to allow me to upload it. content.zip

You may have posted the wrong one, this seems to be the first one in your screenshots where the author is Ulf Blank and correctly shows as per your screenshot.

Ah. i thought you were asking for the first one. The first one has 3 issues:

1) Series not picked up 2) <<>> thing at summary 3) Title photo not picked up (I posted the embedded jpeg as well

I can also post the opf file of the second. The second is not part of a series, so this issue wouldn't happen. It also has the <<> issue, the title photo missing, and a "split writer" (you had seen the screenshot).

Let me find the opf file and post it here.

steve1977 commented 3 years ago

Attached the opf file of the 2nd file. Issue may be because this one is epub2 rather than epub3? The first one is epub3 though...

I am attaching in zip the embedded opf and jpg.

1979 - Christian Kracht.zip

gotson commented 3 years ago

Attached the opf file of the 2nd file. Issue may be because this one is epub2 rather than epub3? The first one is epub3 though...

I am attaching in zip the embedded opf and jpg.

1979 - Christian Kracht.zip

For this book, could you provide the content of titlepage.xhtml ? Komga has to parse the HTML pages to find images inside. I checked an epub2 i have, and it seems it's using a <svg><image> instead of a plain HTML <img> tag, which throws the parser off. I'd like to confirm that's the same for your book.

steve1977 commented 3 years ago

Thanks. The title page exists with both books (pls see the two screenshots). And as requested, please see below the file for the second book (epub2).

titlepage.zip

gotson commented 3 years ago

Thanks. The title page exists with both books (pls see the two screenshots). And as requested, please see below the file for the second book (epub2).

titlepage.zip

Thanks, it looks the same as the one I have, which confirms my hypothesis.

steve1977 commented 3 years ago

Great, thanks! Also found what's going on with the series tag?

gotson commented 3 years ago

Great, thanks! Also found what's going on with the series tag?

Yes I mentioned it above, the xml name spacing is different in the files generated by calibre than the ones I had before. It's valid xml but the parser is not configured properly for that.

The authors were also split by comma, but doesn't make sense since epub can have one tag per author, so splitting is unnecessary, and can have this side effect.

I need to add unit tests, as that part of the code doesn't have one, and do all the fixes.

For the html tags in the description I have to try and test. I need to strip them away.

gotson commented 3 years ago

I've pushed a lot of changes regarding epub handling:

  • Metadata: the description will be cleaned of all HTML tags, keeping only the text.
  • Metadata: the opf: prefix added by Calibre could mess up the parser, it was happening for different elements or attributes.
  • Metadata: authors are not split by , anymore
  • Metadata: ISBN could be of the form isbn:xxxxxxxxxx which would not be parsed. This is fixed.
  • Metadata: dates in ISO format with timestamp (added by Calibre) would not be parsed. This is fixed.
  • Analysis: if images where enclosed in a <svg><image> tag, they would not be detected. This is fixed.

Once the release is out and after you update, you will need to:

  • analyze the books again, so the Cover is properly retrieved and generated
  • analyzing will automatically trigger a refresh of the metadata
github-actions[bot] commented 3 years ago

:tada: This issue has been resolved in version 0.100.2 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

steve1977 commented 3 years ago

Thanks for the development and fix. This is great!

Both ebooks seem to work now. However, other epub3 files with series tag (belongstocollection) still do not work and still show under the folder name. I analyzed and refreshed (and even newly added), but all without success.

Shall I open a new ticket and provided xml data?

gotson commented 3 years ago

Shall I open a new ticket and provided xml data?

Yes please. You can post here, I reopened the issue.

steve1977 commented 3 years ago

Here is the embedded metadata file as attachment

content.zip

steve1977 commented 3 years ago

And a separate issue: "number of pages" does not display correctly for the second book that I had originally posted (and also not for the one just posted now). I don't mind too much if that's not possible, but thought worth reporting.

gotson commented 3 years ago

And a separate issue: "number of pages" does not display correctly for the second book that I had originally posted (and also not for the one just posted now). I don't mind too much if that's not possible, but thought worth reporting.

Number of pages will only show number of images, so will most likely always be wrong for ebooks. There's no plan to fix it for now. If ebook support (#221) ever lands, then it would be handled.

gotson commented 3 years ago

Here is the embedded metadata file as attachment

content.zip

That's the first book you posted, it's what I used for the unit tests. The series name is properly retrieved : https://github.com/gotson/komga/blob/7910273dfcbb7e2b61852d99f4efe17f7e8d6f21/komga/src/test/kotlin/org/gotson/komga/infrastructure/metadata/epub/EpubMetadataProviderTest.kt#L110

steve1977 commented 3 years ago

Thanks. Don't worry about the number of pages. Not important for me.

It shows correctly for the first book I posted. The third (posted today) was tagged the same way and also have the same series tag, but doesn't seem to be picked up. It's listed under "Die Drei Fragezeichen-Kids, Bd.3, Invasion Der Fliegen (247)" instead of "Die drei ??? Kids".

gotson commented 3 years ago

How are those books organized on disk?

gotson commented 3 years ago

The third (posted today) was tagged the same way and also have the same series tag, but doesn't seem to be picked up.

Sorry about that, i was looking at the wrong file (on my phone). I am running that last book you posted through tests now.

gotson commented 3 years ago

It's fixed, i also found a couple more bugs thanks to your file.

steve1977 commented 3 years ago

Amazing, thanks a lot! Will keep troubleshooting with my library whether anything else. And then the next step is to figure out how to have Calibre change the title of the book.

github-actions[bot] commented 3 years ago

:tada: This issue has been resolved in version 0.100.3 :tada:

The release is available on:

Your semantic-release bot :package::rocket:

gotson commented 3 years ago

Amazing, thanks a lot! Will keep troubleshooting with my library whether anything else. And then the next step is to figure out how to have Calibre change the title of the book.

This is handled via dc:title. For example here, it will be parsed as Panik im Paradies (test here).

steve1977 commented 3 years ago

Thanks. Yes. But I'd like to show the title as "series [series number]". There is some way to do this with Calibre, which I need to figure out. But this shoudl be handled with Calibre rather than Komga.

I will do some testing with the new fix later tonight.

steve1977 commented 3 years ago

I've given it a go with a refresh / analyze. Unfortunately, it is still not yet working as I would have hoped. It seems that the "series" is now shown ("Die Drei ??? Kids" in this example). The overview list ("browse") shows the individual books (all called "Die Drei ??? KIds", but don't group them as one "series". Anything I can provide for trouble-shooting?

gotson commented 3 years ago

I've given it a go with a refresh / analyze. Unfortunately, it is still not yet working as I would have hoped. It seems that the "series" is now shown ("Die Drei ??? Kids" in this example). The overview list ("browse") shows the individual books (all called "Die Drei ??? KIds", but don't group them as one "series". Anything I can provide for trouble-shooting?

That's a different thing entirely. Because you are using Calibre, the books are in their own folder. But Komga expects books of the same series in the same folder. That's a basic tenet of Komga.

If you want to show books of the same series in the same series, you need to move your files to the same folder. Calibre unfortunately is not flexible at all with the folder structure, so using a Calibre library folder in Komga will not render nicely, and will duplicate series.

steve1977 commented 3 years ago

Got it. So, one work-around could be to copy them all into one folder. They could no longer be read with Calibre, but I could then edit the tags via Komga (after doing the initial tag with Calibre). This could work, right?

What about if I place two ebooks from different series or from without a series tag into the same folder? Will this split them?

gotson commented 3 years ago

Got it. So, one work-around could be to copy them all into one folder. They could no longer be read with Calibre, but I could then edit the tags via Komga (after doing the initial tag with Calibre). This could work, right?

yes, that would work.

What about if I place two ebooks from different series or from without a series tag into the same folder? Will this split them?

No, the metadata import rules will apply, especially: "if multiple books have the property belongs-to-collection set, the most frequent value will be used"