Memory usage while reading workbook with large amount of data (大文件内存占用的问题)

aswjh commented 1 year ago

debian 12系统，excelize用的7.15号go get -u的版本，打开一个180M的xlsx（包200个sheet，其中2个sheet内容最多），设置比较大的UnzipXMLSizeLimit不解压到硬盘，在内存里处理，一开始内存占用近2G，逐渐增加到6G，持续一段时间后完成，耗时3分多钟。

主要代码如下：

xl, err := excelize.OpenFile(fn, excelize.Options {UnzipXMLSizeLimit: 1e18}); 
if err == nil {
    for _, sheetName := range xl.GetSheetList() {
        if rows, err2 := xl.Rows(sheetName); err2 == nil {
            for rows.Next() {
                if row, err3 := rows.Columns(); err3 == nil {
                    //proc
                }
            }
        }
    }
}

尝试用rust的calamine库比较了一下，一开始占用50M,逐渐增加到1.3G就是最高峰了，比wps占用还少点，处理时间也是3分多钟，和go差别不大。主要代码如下：

    let mut wb: Xlsx<_> = open_workbook(fn).unwrap();
    let sheets = wb.sheet_names().to_owned();
    for sheet_name in sheets {
        if let Some(Ok(r)) = wb.worksheet_range(&sheet_name) {
            let mut cnt = 0;
            for row in r.rows() {
                cnt += 1;
            }
            println!("fffff={:?}, cnt={:?}", sheet_name, cnt);
        }
    }

用wps office表格打开占用1.9G. 感觉excelize内存占用偏大，不知道有没有改进的空间

xuri commented 1 year ago

Thanks for your issue. Could you provide a reproducible demo and your input file attachment without confidential info? This issue was similar to #1096. Please decrease the value of UnzipXMLSizeLimit to avoid high memory usage.

aswjh commented 1 year ago

发现用wps表格保存过的大文件也存在同样的问题。这是测试代码：

package main

import (
    "fmt"
    "flag"
    "time"
    "math/rand"
    "github.com/qax-os/excelize"
)

//go build t.go && ./t -gen               //生成一个测试文件/tmp/gen.xlsx
//./t -name=/tmp/gen.xlsx                 //读取上面生成的文件，内存占用正常。若用wps表格打开gen.xlsx再保存，文件体积增大，再进行读测试，内存占用就会剧增。

var (
    genXlsx = flag.Bool("gen", false, "genXlsx")
    name = flag.String("name", "", "filename")

    letterRunes = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
)

//
func testRead(fn string) {
    if xl, err := excelize.OpenFile(fn, excelize.Options {UnzipXMLSizeLimit: 1e18}); err == nil {
        for _, sheetName := range xl.GetSheetList() {
            cnt := 0
            if rows, err2 := xl.Rows(sheetName); err2 == nil {
                for rows.Next() {
                    if _, err3 := rows.Columns(); err3 == nil {
                        cnt += 1
                    }
                }
            }
            fmt.Println(sheetName, cnt)
        }
    }
}

//
func GenXlsxTest(fn string) {
    rand.Seed(time.Now().UnixNano())
    xl := excelize.NewFile()
    for i := 0; i <= 22; i ++ {
        sheetname := fmt.Sprintf("测试表%v", i)
        xl.NewSheet(sheetname)

        streamWriter, _ := xl.NewStreamWriter(sheetname)
        header := []interface{} {"序号", "名", "名2", "cc", "ccci", "zzz", "ccb", "hjjj", "jfjfj"}
        cell, _ := excelize.CoordinatesToCellName(1, 1)
        streamWriter.SetRow(cell, header)

        num := 125
        switch i {
            case 3: num = 150000
            case 7: num = 450000
        }

        for rowi := 2; rowi <= num; rowi ++ {
            row := []interface{} {rowi-1, RandStr(5), RandStr(6), RandStr(7), RandStr(7), RandStr(7), RandStr(7), RandStr(8), RandStr(8), RandStr(8), RandStr(8), RandStr(9), RandStr(9), RandStr(9), RandStr(9), RandStr(10), RandStr(10), RandStr(10), RandStr(10)}
            cell, _ := excelize.CoordinatesToCellName(1, rowi)
            if err := streamWriter.SetRow(cell, row); err != nil {
                fmt.Println(rowi, err)
            }
        }
        if err := streamWriter.Flush(); err != nil {
            println(err)
        }
    }

    xl.SaveAs(fn)
    println("保存到", fn)
}

func RandStr(n int) string {
    b := make([]rune, n)
    for i := range b {
        b[i] = letterRunes[rand.Intn(len(letterRunes))]
    }
    return string(b)
}

func main() {
    flag.Parse()
    if *genXlsx {
        GenXlsxTest("/tmp/gen.xlsx")
    } else if *name != "" {
        testRead(*name)
    }
}

xuri commented 1 year ago

Thanks for your feedback. As the notice in the above reply, please decrease the value of UnzipXMLSizeLimit to avoid high memory usage. I have tested with following code with the workbook generated with your code. The RSS memory usage was about 254MB, so the performance was in expectation.

package main

import (
    "fmt"
    "runtime"
    "syscall"
    "time"

    "github.com/xuri/excelize/v2"
)

func main() {
    runtime.GC()
    startTime := time.Now()
    f, err := excelize.OpenFile("gen.xlsx")
    if err != nil {
        fmt.Println(err)
        return
    }
    defer func() {
        if err := f.Close(); err != nil {
            fmt.Println(err)
        }
    }()
    for _, sheetName := range f.GetSheetList() {
        cnt := 0
        if rows, err2 := f.Rows(sheetName); err2 == nil {
            for rows.Next() {
                if _, err3 := rows.Columns(); err3 == nil {
                    cnt += 1
                }
            }
        }
        fmt.Println(sheetName, cnt)
    }
    printBenchmarkInfo("main", startTime)
}

func printBenchmarkInfo(fn string, startTime time.Time) {
    var memStats runtime.MemStats
    var rusage syscall.Rusage
    var bToMb = func(b uint64) uint64 {
        return b / 1024 / 1024
    }
    runtime.ReadMemStats(&memStats)
    syscall.Getrusage(syscall.RUSAGE_SELF, &rusage)
    fmt.Printf("Func: %s \tRSS = %v MB\tAlloc = %v MB\tTotalAlloc = %v MB\tSys = %v MB\tNumGC = %v \tCost = %s\n",
        fn, bToMb(uint64(rusage.Maxrss)), bToMb(memStats.Alloc), bToMb(memStats.TotalAlloc), bToMb(memStats.Sys), memStats.NumGC, time.Since(startTime))
}

Benchmark info:

Func: main      RSS = 254 MB    Alloc = 6 MB    TotalAlloc = 16921 MB   Sys = 273 MB    NumGC = 4383    Cost = 57.174368275s

aswjh commented 1 year ago

需要用wps表格打开第一步生成的gen.xlsx，测试wps et保存后的文件，内存占用会增加很多。

UnzipXMLSizeLimit设得比较大，一方面是想测试不解压到硬盘的内存占用，另外我的系统/tmp是挂载到ram的，只分配了6G，如果解压到/tmp空间不够

xuri commented 1 year ago

Thanks for your feedback. The stream writer writes string cell value as an inline string (please also reference #1377), which storage cell value in the worksheet instead of SST (shared string table) for get better read performance with row iterator, but the workbook resaved after Kingsoft WPS, the cell value will be storage to the SST, so the library needs to parse SST internal parts of the workbook, that's will take more memory usage. The library can read, parse, and validate many internal workbook structures to get better capability, more sure the generated workbook is not corrupted, and more features and support, that will use more memory. I suggest decreasing the value of the UnzipXMLSizeLimit options for opening the workbook with the amount of data or increase your disk storage or memory resource to get better performance. I've closed this issue, and if you have any questions, please let me know and reopen this anytime.

qax-os / excelize

Memory usage while reading workbook with large amount of data (大文件内存占用的问题) #1581