Optimizing parsepngstream

ajstarks commented 2 months ago

Based on profiling my application, image processing is the most CPU-intensive, with parsepngstream using a good bit of the time. Is there an opportunity to optimize this function? Perhaps using image/png from the standard library

Here is my profile:

File: ppdfdeck
Type: cpu
Time: May 23, 2024 at 3:11pm (EDT)
Duration: 4.11s, Total samples = 3.91s (95.24%)
Showing nodes accounting for 3.83s, 97.95% of 3.91s total
Dropped 36 nodes (cum <= 0.02s)
      flat  flat%   sum%        cum   cum%
     0.64s 16.37% 16.37%      0.64s 16.37%  github.com/go-pdf/fpdf.(*wbuffer).u8 (inline)
     0.36s  9.21% 25.58%      0.36s  9.21%  compress/flate.(*deflateFast).matchLen
     0.36s  9.21% 34.78%      0.36s  9.21%  hash/adler32.update
     0.27s  6.91% 41.69%      0.27s  6.91%  runtime.memmove
     0.26s  6.65% 48.34%      0.77s 19.69%  compress/flate.(*deflateFast).encode
     0.24s  6.14% 54.48%      0.29s  7.42%  compress/flate.(*decompressor).huffSym
     0.23s  5.88% 60.36%      0.76s 19.44%  compress/flate.(*decompressor).huffmanBlock
     0.21s  5.37% 65.73%      0.21s  5.37%  runtime.memclrNoHeapPointers
     0.19s  4.86% 70.59%      0.37s  9.46%  compress/flate.(*huffmanBitWriter).writeTokens
     0.18s  4.60% 75.19%      3.67s 93.86%  github.com/go-pdf/fpdf.(*Fpdf).parsepngstream
     0.10s  2.56% 77.75%      0.11s  2.81%  compress/flate.(*huffmanBitWriter).writeCode
     0.08s  2.05% 79.80%      0.33s  8.44%  compress/flate.(*huffmanBitWriter).indexTokens
     0.07s  1.79% 81.59%      0.09s  2.30%  compress/flate.(*huffmanEncoder).bitCounts
     0.06s  1.53% 83.12%      0.14s  3.58%  compress/flate.(*dictDecoder).tryWriteCopy (inline)
     0.06s  1.53% 84.65%      0.06s  1.53%  compress/flate.load32 (inline)
     0.05s  1.28% 85.93%      0.05s  1.28%  bytes.(*Reader).ReadByte
     0.04s  1.02% 86.96%      0.04s  1.02%  compress/flate.(*decompressor).moreBits
     0.04s  1.02% 87.98%      0.04s  1.02%  compress/flate.(*dictDecoder).writeByte (inline)
     0.04s  1.02% 89.00%      0.04s  1.02%  compress/flate.(*huffmanBitWriter).writeBits
     0.04s  1.02% 90.03%      0.04s  1.02%  compress/flate.load64 (inline)
     0.04s  1.02% 91.05%      0.04s  1.02%  compress/flate.token.length (inline)
     0.04s  1.02% 92.07%      0.04s  1.02%  runtime/internal/syscall.Syscall6
     0.04s  1.02% 93.09%      0.05s  1.28%  sort.insertionSort
     0.03s  0.77% 93.86%      0.03s  0.77%  crypto/sha1.blockAVX2
     0.03s  0.77% 94.63%      0.04s  1.02%  sort.partition
     0.02s  0.51% 95.14%      0.02s  0.51%  compress/flate.(*dictDecoder).histSize (inline)
     0.02s  0.51% 95.65%      0.02s  0.51%  compress/flate.(*huffmanBitWriter).writeDynamicHeader
     0.02s  0.51% 96.16%      0.02s  0.51%  compress/flate.byLiteral.Less
     0.02s  0.51% 96.68%      0.02s  0.51%  compress/flate.lengthCode (inline)
     0.02s  0.51% 97.19%      0.02s  0.51%  compress/flate.token.offset (inline)
     0.02s  0.51% 97.70%      0.02s  0.51%  runtime.procyield
     0.01s  0.26% 97.95%      0.20s  5.12%  compress/flate.(*huffmanEncoder).generate
         0     0% 97.95%      1.10s 28.13%  bytes.(*Buffer).ReadFrom
         0     0% 97.95%      0.06s  1.53%  bytes.(*Buffer).WriteString
         0     0% 97.95%      0.16s  4.09%  bytes.(*Buffer).grow
         0     0% 97.95%      0.16s  4.09%  bytes.growSlice
         0     0% 97.95%      1.53s 39.13%  compress/flate.(*Writer).Write (inline)
         0     0% 97.95%      0.07s  1.79%  compress/flate.(*byFreq).sort (inline)
         0     0% 97.95%      0.03s  0.77%  compress/flate.(*byLiteral).sort (inline)
         0     0% 97.95%      1.51s 38.62%  compress/flate.(*compressor).encSpeed
         0     0% 97.95%      0.02s  0.51%  compress/flate.(*compressor).fillStore
         0     0% 97.95%      0.02s  0.51%  compress/flate.(*compressor).init
         0     0% 97.95%      1.53s 39.13%  compress/flate.(*compressor).write
         0     0% 97.95%      0.83s 21.23%  compress/flate.(*decompressor).Read
         0     0% 97.95%      0.14s  3.58%  compress/flate.(*decompressor).nextBlock
         0     0% 97.95%      0.74s 18.93%  compress/flate.(*huffmanBitWriter).writeBlockDynamic
         0     0% 97.95%      0.03s  0.77%  compress/flate.(*huffmanEncoder).assignEncodingAndSize
         0     0% 97.95%      0.04s  1.02%  compress/flate.NewWriter (inline)
         0     0% 97.95%      0.04s  1.02%  compress/flate.NewWriterDict
         0     0% 97.95%      1.78s 45.52%  compress/zlib.(*Writer).Write
         0     0% 97.95%      0.04s  1.02%  compress/zlib.(*Writer).writeHeader
         0     0% 97.95%      0.98s 25.06%  compress/zlib.(*reader).Read
         0     0% 97.95%      0.03s  0.77%  crypto/sha1.(*digest).Write
         0     0% 97.95%      0.03s  0.77%  crypto/sha1.block
         0     0% 97.95%      0.09s  2.30%  github.com/go-pdf/fpdf.(*Fpdf).Close
         0     0% 97.95%      3.77s 96.42%  github.com/go-pdf/fpdf.(*Fpdf).ImageOptions
         0     0% 97.95%      0.10s  2.56%  github.com/go-pdf/fpdf.(*Fpdf).Output
         0     0% 97.95%      3.77s 96.42%  github.com/go-pdf/fpdf.(*Fpdf).RegisterImageOptions
         0     0% 97.95%      3.77s 96.42%  github.com/go-pdf/fpdf.(*Fpdf).RegisterImageOptionsReader
         0     0% 97.95%      0.09s  2.30%  github.com/go-pdf/fpdf.(*Fpdf).enddoc
         0     0% 97.95%      0.06s  1.53%  github.com/go-pdf/fpdf.(*Fpdf).out
         0     0% 97.95%      3.73s 95.40%  github.com/go-pdf/fpdf.(*Fpdf).parsepng
         0     0% 97.95%      0.08s  2.05%  github.com/go-pdf/fpdf.(*Fpdf).putimage
         0     0% 97.95%      0.08s  2.05%  github.com/go-pdf/fpdf.(*Fpdf).putimages
         0     0% 97.95%      0.08s  2.05%  github.com/go-pdf/fpdf.(*Fpdf).putresources
         0     0% 97.95%      0.07s  1.79%  github.com/go-pdf/fpdf.(*Fpdf).putstream
         0     0% 97.95%      0.03s  0.77%  github.com/go-pdf/fpdf.(*idEncoder).bytes (inline)
         0     0% 97.95%      1.79s 45.78%  github.com/go-pdf/fpdf.(*xmempool).compress
         0     0% 97.95%      1.03s 26.34%  github.com/go-pdf/fpdf.(*xmempool).uncompress
         0     0% 97.95%      0.03s  0.77%  github.com/go-pdf/fpdf.generateImageID
         0     0% 97.95%      0.06s  1.53%  github.com/go-pdf/fpdf.newRBuffer
         0     0% 97.95%      0.36s  9.21%  hash/adler32.(*digest).Write
         0     0% 97.95%      0.03s  0.77%  internal/poll.(*FD).Read
         0     0% 97.95%      0.04s  1.02%  internal/poll.ignoringEINTRIO (inline)
         0     0% 97.95%      3.87s 98.98%  main.dodeck
         0     0% 97.95%      3.77s 96.42%  main.doslides
         0     0% 97.95%      3.87s 98.98%  main.main
         0     0% 97.95%      3.77s 96.42%  main.pdfslide
         0     0% 97.95%      0.03s  0.77%  os.(*File).Read
         0     0% 97.95%      0.03s  0.77%  os.(*File).read (inline)
         0     0% 97.95%      0.03s  0.77%  runtime.gcBgMarkWorker
         0     0% 97.95%      0.03s  0.77%  runtime.gcBgMarkWorker.func2
         0     0% 97.95%      0.03s  0.77%  runtime.gcDrain
         0     0% 97.95%      0.03s  0.77%  runtime.gcDrainMarkWorkerDedicated (inline)
         0     0% 97.95%      0.02s  0.51%  runtime.growslice
         0     0% 97.95%      3.87s 98.98%  runtime.main
         0     0% 97.95%      0.04s  1.02%  runtime.makeslice
         0     0% 97.95%      0.06s  1.53%  runtime.mallocgc
         0     0% 97.95%      0.03s  0.77%  runtime.markroot
         0     0% 97.95%      0.03s  0.77%  runtime.markroot.func1
         0     0% 97.95%      0.03s  0.77%  runtime.memclrNoHeapPointersChunked
         0     0% 97.95%      0.02s  0.51%  runtime.newobject
         0     0% 97.95%      0.03s  0.77%  runtime.suspendG
         0     0% 97.95%      0.04s  1.02%  runtime.systemstack
         0     0% 97.95%      0.10s  2.56%  sort.Sort
         0     0% 97.95%      0.10s  2.56%  sort.pdqsort
         0     0% 97.95%      0.04s  1.02%  syscall.RawSyscall6
         0     0% 97.95%      0.03s  0.77%  syscall.Read (inline)
         0     0% 97.95%      0.04s  1.02%  syscall.Syscall
         0     0% 97.95%      0.03s  0.77%  syscall.read

sbinet commented 2 months ago

hi Anthony,

do you have a simple reproducer ? (so I don't optimize something irrelevant for your use case)

note for self: parsepngstream looks into the pHYs ancillary PNG chunk. image/png.Decode ignores it. here's some biblio about it:

ajstarks commented 2 months ago

Here is a reproducer: It works by making 20 pages using copies of two files "hello-world.png", and "hello-mars.png" (attached) so that each process is a separate file. I prepared them like this:

for i in `seq -f %02g 0 19`
do 
    cp hello-world.png hello-world-$i.png
    cp hello-mars.png hello-mars-$i.png
done

FYI, just processing the same files is a lot faster, (200ms vs. 1.1sec) but that does not match my use case.

package main

import (
    "fmt"
    "os"
    "runtime/pprof"

    "github.com/go-pdf/fpdf"
)

func main() {

    // PDF page is 1920x1080
    pw := 1920.0
    ph := 1080.0

    // set up the page
    pageconfig := fpdf.InitType{
        UnitStr:    "pt",
        SizeStr:    "Widescreen",
        Size:       fpdf.SizeType{Wd: pw, Ht: ph},
        FontDirStr: "",
    }

    pdf := fpdf.NewCustom(&pageconfig)
    imgopt := fpdf.ImageOptions{AllowNegativePosition: true}

    // use copies of files to exercise processing individual files
    f1 := []string{}
    f2 := []string{}
    for i := 0; i < 20; i++ {
        f1 = append(f1, fmt.Sprintf("hello-world-%02d.png", i))
        f2 = append(f2, fmt.Sprintf("hello-mars-%02d.png", i))
    }

    // Profiling...
    pf, err := os.Create("default.pgo")
    if err != nil {
        fmt.Fprintf(os.Stderr, "could not create CPU profile: ", err)
        os.Exit(1)
    }
    defer pf.Close()

    if err := pprof.StartCPUProfile(pf); err != nil {
        fmt.Fprintf(os.Stderr, "could not start CPU profile: ", err)
        os.Exit(1)
    }
    defer pprof.StopCPUProfile()

    // make pages: four images per page, scaled 50%
    // substitute discrete files for f1[i], and f2[i] to
    // test processing the sames files in the loop
    scale := 50.0
    iw := pw * (scale / 100)
    ih := ph * (scale / 100)
    for i := 0; i < len(f1); i++ {
        pdf.AddPage()
        pdf.ImageOptions(f1[i], 0, 0, iw, ih, false, imgopt, 0, "")
        pdf.ImageOptions(f2[i], 960, 0, iw, ih, false, imgopt, 0, "")
        pdf.ImageOptions(f2[i], 0, 540, iw, ih, false, imgopt, 0, "")
        pdf.ImageOptions(f1[i], 960, 540, iw, ih, false, imgopt, 0, "")
    }
    // write the PDF
    if err := pdf.OutputFileAndClose("pdfimg.pdf"); err != nil {
        fmt.Fprintf(os.Stderr, "%v\n", err)
        os.Exit(1)
    }
}

here is its profile:

File: pdfimg
Type: cpu
Time: May 24, 2024 at 6:08pm (EDT)
Duration: 1.10s, Total samples = 990ms (89.86%)
Showing nodes accounting for 990ms, 100% of 990ms total
      flat  flat%   sum%        cum   cum%
     240ms 24.24% 24.24%      240ms 24.24%  github.com/go-pdf/fpdf.(*wbuffer).u8 (inline)
     150ms 15.15% 39.39%      150ms 15.15%  hash/adler32.update
     120ms 12.12% 51.52%      120ms 12.12%  compress/flate.(*deflateFast).matchLen
     100ms 10.10% 61.62%      100ms 10.10%  runtime.memmove
      80ms  8.08% 69.70%      960ms 96.97%  github.com/go-pdf/fpdf.(*Fpdf).parsepngstream
      80ms  8.08% 77.78%       80ms  8.08%  runtime.memclrNoHeapPointers
      50ms  5.05% 82.83%       50ms  5.05%  compress/flate.(*huffmanEncoder).bitCounts (inline)
      40ms  4.04% 86.87%      170ms 17.17%  compress/flate.(*deflateFast).encode (inline)
      20ms  2.02% 88.89%       20ms  2.02%  compress/flate.(*huffmanBitWriter).writeCode
      10ms  1.01% 89.90%       10ms  1.01%  bytes.(*Reader).ReadByte
      10ms  1.01% 90.91%       30ms  3.03%  compress/flate.(*dictDecoder).tryWriteCopy (inline)
      10ms  1.01% 91.92%       10ms  1.01%  compress/flate.(*dictDecoder).writeCopy
      10ms  1.01% 92.93%       10ms  1.01%  compress/flate.byFreq.Less
      10ms  1.01% 93.94%       10ms  1.01%  compress/flate.byFreq.Swap
      10ms  1.01% 94.95%       10ms  1.01%  compress/flate.hash (inline)
      10ms  1.01% 95.96%       10ms  1.01%  compress/flate.reverseBits (inline)
      10ms  1.01% 96.97%       10ms  1.01%  crypto/sha1.blockAVX2
      10ms  1.01% 97.98%      160ms 16.16%  hash/adler32.(*digest).Write (inline)
      10ms  1.01% 98.99%       10ms  1.01%  runtime.procyield
      10ms  1.01%   100%       20ms  2.02%  sort.insertionSort
         0     0%   100%       10ms  1.01%  bytes.(*Buffer).Grow (inline)
         0     0%   100%      230ms 23.23%  bytes.(*Buffer).ReadFrom
         0     0%   100%       30ms  3.03%  bytes.(*Buffer).grow
         0     0%   100%       30ms  3.03%  bytes.growSlice
         0     0%   100%      290ms 29.29%  compress/flate.(*Writer).Write (inline)
         0     0%   100%       30ms  3.03%  compress/flate.(*byFreq).sort (inline)
         0     0%   100%      280ms 28.28%  compress/flate.(*compressor).encSpeed
         0     0%   100%       10ms  1.01%  compress/flate.(*compressor).fillStore
         0     0%   100%      290ms 29.29%  compress/flate.(*compressor).write
         0     0%   100%      120ms 12.12%  compress/flate.(*decompressor).Read
         0     0%   100%       10ms  1.01%  compress/flate.(*decompressor).huffSym (inline)
         0     0%   100%       50ms  5.05%  compress/flate.(*decompressor).huffmanBlock
         0     0%   100%       90ms  9.09%  compress/flate.(*huffmanBitWriter).indexTokens
         0     0%   100%      110ms 11.11%  compress/flate.(*huffmanBitWriter).writeBlockDynamic
         0     0%   100%       20ms  2.02%  compress/flate.(*huffmanBitWriter).writeTokens (inline)
         0     0%   100%       10ms  1.01%  compress/flate.(*huffmanEncoder).assignEncodingAndSize
         0     0%   100%       90ms  9.09%  compress/flate.(*huffmanEncoder).generate
         0     0%   100%       40ms  4.04%  compress/flate.NewWriter (inline)
         0     0%   100%       40ms  4.04%  compress/flate.NewWriterDict
         0     0%   100%      400ms 40.40%  compress/zlib.(*Writer).Write
         0     0%   100%       40ms  4.04%  compress/zlib.(*Writer).writeHeader
         0     0%   100%      210ms 21.21%  compress/zlib.(*reader).Read
         0     0%   100%       10ms  1.01%  crypto/sha1.(*digest).Write
         0     0%   100%       10ms  1.01%  crypto/sha1.block
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).Close
         0     0%   100%      970ms 97.98%  github.com/go-pdf/fpdf.(*Fpdf).ImageOptions
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).Output
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).OutputFileAndClose
         0     0%   100%      970ms 97.98%  github.com/go-pdf/fpdf.(*Fpdf).RegisterImageOptions
         0     0%   100%      970ms 97.98%  github.com/go-pdf/fpdf.(*Fpdf).RegisterImageOptionsReader
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).enddoc
         0     0%   100%      960ms 96.97%  github.com/go-pdf/fpdf.(*Fpdf).parsepng
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*Fpdf).putpages
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.(*idEncoder).bytes (inline)
         0     0%   100%      410ms 41.41%  github.com/go-pdf/fpdf.(*xmempool).compress
         0     0%   100%      230ms 23.23%  github.com/go-pdf/fpdf.(*xmempool).uncompress
         0     0%   100%       10ms  1.01%  github.com/go-pdf/fpdf.generateImageID
         0     0%   100%      980ms 98.99%  main.main
         0     0%   100%       10ms  1.01%  runtime.gcBgMarkWorker
         0     0%   100%       10ms  1.01%  runtime.gcBgMarkWorker.func2
         0     0%   100%       10ms  1.01%  runtime.gcDrain
         0     0%   100%       10ms  1.01%  runtime.gcDrainMarkWorkerDedicated (inline)
         0     0%   100%      980ms 98.99%  runtime.main
         0     0%   100%       10ms  1.01%  runtime.makeslice
         0     0%   100%       50ms  5.05%  runtime.mallocgc
         0     0%   100%       10ms  1.01%  runtime.markroot
         0     0%   100%       10ms  1.01%  runtime.markroot.func1
         0     0%   100%       10ms  1.01%  runtime.memclrNoHeapPointersChunked
         0     0%   100%       40ms  4.04%  runtime.newobject
         0     0%   100%       10ms  1.01%  runtime.suspendG
         0     0%   100%       10ms  1.01%  runtime.systemstack
         0     0%   100%       30ms  3.03%  sort.Sort
         0     0%   100%       10ms  1.01%  sort.partition
         0     0%   100%       30ms  3.03%  sort.pdqsort

Here are the test images: hello-mars hello-world

ajstarks commented 2 months ago

I note that JPG vs. PNG processing is 5x faster (by converting the pngs to jpg via convert(1)

ajstarks commented 2 months ago

It looks like the hot spots these two nested loops, where the .u8 method is on the image data before it's compressed.

https://github.com/go-pdf/fpdf/blob/504c6dd8cc916cd7f2097877efd52cae5f1d8b18/png.go#L190-L200

and

https://github.com/go-pdf/fpdf/blob/504c6dd8cc916cd7f2097877efd52cae5f1d8b18/png.go#L210-L223

go-pdf / fpdf

Optimizing parsepngstream #73