go-shiori / go-epub

Go library for creating EPUB files
MIT License
44 stars 7 forks source link

fix: image with same URL added twice and more reliable image ID #3

Closed Monirzadeh closed 1 year ago

Monirzadeh commented 1 year ago

EmbededImage add same URL how many times it exist in HTML file that make unnecessary big file and some other problem. this PR download and add image to epub just once and reference all of same url image to the same file inside epub.

  1. not download image twice
  2. better id for images in opf file
  3. fix content of image tag inside HTML to show correctly in eBooks reader.
Monirzadeh commented 1 year ago

@fmartingr do you have idea how should i update unit test for this situation or i should just remove this for new logic?

Monirzadeh commented 1 year ago

id should pass some rules. if one of this not pass we can't show images correctly on all device.

in old method it use filename as id. before add file to opf pass filename to fixXMLid to check all rules and fix them if it is not suitable. for example add id to the first, if file name not start with English alphabet. but it is not done that prefect. for example if filename is something like this filename-with-حروف-غیر-انگلیسی.jpg old method can't handle that and not remove that character is not acceptable. if filename is just contain not acceptable character it can't handle that just removing them is not a good solution and detect that in an string can be challenging (for each language in the world that someone named a file). we need a more reliable way to handle that is to always create acceptable id for files.

Monirzadeh commented 1 year ago

why we change attribute? to show up correctly image in ebook file img should work in specific way ,if not we cant show images on all device device.

this method check each img tag individually and fix above problem.

Monirzadeh commented 1 year ago

@fmartingr my problem with update unit test is each time we generate new id so how should i test that in unit test. how can i get that from run to check at the end of test and if there same or not in final file?

fmartingr commented 1 year ago

@fmartingr my problem with update unit test is each time we generate new id so how should i test that in unit test. how can i get that from run to check at the end of test and if there same or not in final file?

Maybe you can use a v5 UUID? Instead of creating a random UUID, you can create one derivated from the filename:

// You need a namespace for V5 uuids, we can generate one based on the project URL
// Some namespaces are predefined in the UUIDs RFC, like the one used for URLs
namespace := uuid.NewV5(uuid.NameSpaceURL, "github.com/go-shiori/go-epub")
fileIdentifier := uuid.NewV5(namespace, filename)

More info: RFC 4122 - A Universally Unique IDentifier (UUID) URN Namespace - Some Name Space IDs

Monirzadeh commented 1 year ago

@fmartingr https://github.com/go-shiori/go-epub/pull/3/commits/383822551c1bea520aed4c84f986d239d84beded is it what you want?

do you have any idea how should i fix unit test?

fmartingr commented 1 year ago

@fmartingr 3838225 is it what you want?

do you have any idea how should i fix unit test?

I understood that the problem with the test was that you were creating random UUIDs for images. If you change that to use deterministic UUIDs (uuidv5) you can generate those at the end of your tests and the result will be the same UUID.

I don't see any test in this PR, is there any code I can check?

Monirzadeh commented 1 year ago

TestManifestItems not pass

--- FAIL: TestManifestItems (0.01s)
    epub_test.go:824: Package file manifest items don't match
        Got: 
        id="id5857b58e-9d4f-5b23-ad10-d2644bb66023" href="images/filename with space.png" media-type="image/png"></item>
        id="id72c19376-e8fb-584c-9dbe-abc447874747" href="images/testfromfile.png" media-type="image/png"></item>
        id="id971cabb4-8f27-530a-b2fc-24044798f8a7" href="images/image0005.png" media-type="image/png"></item>
        id="ide5dc5e63-8586-5ff2-94d9-037b0543ac24" href="images/gophercolor16x16.png" media-type="image/png"></item>
        id="idfbc147f0-b6fd-5f0f-9d96-69f1c80e2ec6" href="images/01filenametest.png" media-type="image/png"></item>
        id="nav" href="nav.xhtml" media-type="application/xhtml+xml" properties="nav"></item>

        Expected: 
        id="filenamewithspace.png" href="images/filename with space.png" media-type="image/png"></item>
        id="gophercolor16x16.png" href="images/gophercolor16x16.png" media-type="image/png"></item>
        id="id01filenametest.png" href="images/01filenametest.png" media-type="image/png"></item>
        id="image0005.png" href="images/image0005.png" media-type="image/png"></item>
        id="nav" href="nav.xhtml" media-type="application/xhtml+xml" properties="nav"></item>
        id="testfromfile.png" href="images/testfromfile.png" media-type="image/png"></item>

CI not run test on this one i don't know why matbe related to #4

fmartingr commented 1 year ago

you need to modify the test if you modify the ID generation.

Merge latest changes from master into this branch and let's check if theres a problem with #4

Monirzadeh commented 1 year ago

now you can see that test in CI but yet i don't have idea how should i fix that.

fmartingr commented 1 year ago

now you can see that test in CI but yet i don't have idea how should i fix that.

Since you are modifying the way IDs are created, you need to edit the test here: https://github.com/go-shiori/go-epub/blob/main/epub_test.go#L845-L851

That raises the question. Should we store the files with the same filename as the ID? images/{uuid}.extension ?

Monirzadeh commented 1 year ago

Since you are modifying the way IDs are created, you need to edit the test here: https://github.com/go-shiori/go-epub/blob/main/epub_test.go#L845-L851

i explain what is the problem to update this section here https://github.com/go-shiori/go-epub/pull/3#issuecomment-1703789380 how should i get UUID when it is generate random UUID in each test run? this is not a fix thing that i can compare at the end.

That raises the question. Should we store the files with the same filename as the ID? images/{uuid}.extension ?

go-epub automatically change name if it exist same filename before in images/ so we never have same filename in images/ i think it is unnecessary do you have any reason to do that?

Monirzadeh commented 1 year ago

i change that unit test:

  1. get any idUUID format in manifest
  2. count them and check with what we expected
  3. be sure nav is exist in manifest

not prefect but it is better than noting.

fmartingr commented 1 year ago

i change that unit test:

  1. get any idUUID format in manifest
  2. count them and check with what we expected
  3. be sure nav is exist in manifest

not prefect but it is better than noting.

I'm thinking that since the files are hardcoded, you can hardcode the IDs as well, since they are deterministic.

It would also be interesting to check if a duplicate <item id="x"> is allowed by the EPUB spec.

Monirzadeh commented 1 year ago

thanks i review them to impediment them

Monirzadeh commented 1 year ago

@fmartingr i back to old method just replace with new id. about duplicate id. when we check that with fix variable if any different show up duplicate similar etc this test faild (current method). do we need check duplication separately?

fmartingr commented 1 year ago

@fmartingr i back to old method just replace with new id. about duplicate id. when we check that with fix variable if any different show up duplicate similar etc this test faild (current method). do we need check duplication separately?

I was thinking on a separate test that adds the same image twice, to see how the generation behaves and confirm that everything works as expected. Do you think this would be useful?

Monirzadeh commented 1 year ago

@fmartingr i back to old method just replace with new id. about duplicate id. when we check that with fix variable if any different show up duplicate similar etc this test faild (current method). do we need check duplication separately?

I was thinking on a separate test that adds the same image twice, to see how the generation behaves and confirm that everything works as expected. Do you think this would be useful?

we have similar to that here

    _, err = e.AddImage(testImageFromFileSource, testImageFromFileFilename)
    if err != nil {
        t.Error(err)
    }
    _, err = e.AddImage(testImageFromFileSource, "")
    if err != nil {
        t.Error(err)
    }

in this part of unit test we add same image twice once with name and other without name (go epub automatically create name image0005.png) that we check exist of that in manifest.

i will add another one same file without naming that. that check generate file name twice

do you want it be a separate test or done that in this test with add images twice.

Monirzadeh commented 1 year ago

@fmartingr i add same file twice in https://github.com/go-shiori/go-epub/pull/3/commits/e57fc62a9a41f9330eb5eee9a7375e97c7b219f5 if you think it should be a separate test tell me to write a sperate test for that.

i think add one file twice is good to check go-epub can handle that (filename and identifire) :+1:

sidenote: go-epub not allow to save same file name inside epub it generate a new name if user not force filename when add files to epub or ignore that as i explain (force add file with same filename to epub) #6.

fmartingr commented 1 year ago

Oh ok I didn't notice that, if we are already handling that scenario I think we're good. Awesome work :D

Monirzadeh commented 1 year ago

yes we cover that. if it OK for you please approve this and other PR.

Monirzadeh commented 1 year ago

Hi. it is enough or do you want to done this functionality with V5 too?

    tempDir := uuid.Must(uuid.NewV4()).String()