grailbio / base

A collection of Go utility packages used by GRAIL's tools
Apache License 2.0
86 stars 24 forks source link

Reading tsv file concurrently with multiple goroutines #38

Open Wkalmar opened 1 year ago

Wkalmar commented 1 year ago

Hello, I'm using your tsv package to read .tsv file. The code below works fine

type row struct {
    Tconst         string `tsv:"tconst"`
    TitleType      string `tsv:"titleType"`
    PrimaryTitle   string `tsv:"primaryTitle"`
    OriginalTitle  string `tsv:"originalTitle"`
    IsAdult        byte   `tsv:"isAdult"`
    StartYear      uint16 `tsv:"startYear"`
    EndYear        string `tsv:"endYear"`
    RuntimeMinutes uint16 `tsv:"runtimeMinutes"`
    Genres         string `tsv:"genres"`
}

func ReadFilePlain() {
    file, err := os.Open("/static/data.tsv")
    if err != nil {
        panic(err)
    }
    defer file.Close()
    r := tsv.NewReader(file)
    r.HasHeaderRow = true
    r.UseHeaderNames = true
    for i := 0; i < 1000; i++ {
        var v row
        err = r.Read(&v)
        if err == nil {
            fmt.Printf("%+v\n", v)
        } else {
            fmt.Println(err)
        }
    }
}

However, when I try to speed things up a bit with using goroutines like this

func ReadFileGoRoutines() {
    file, err := os.Open("/static/data.tsv")
    if err != nil {
        panic(err)
    }
    defer file.Close()
    r := tsv.NewReader(file)
    r.HasHeaderRow = true
    r.UseHeaderNames = true
    var wg sync.WaitGroup
    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func() {
            var v row
            err = r.Read(&v)
            if err == nil {
                fmt.Printf("%+v\n", v)
            } else {
                fmt.Println(err)
            }
            wg.Done()
        }()
    }
    wg.Wait()
}

I get

column tconst does not appear in the header: map[0:4 1:7 1894:5 Carmencita:3 Documentary,Short:8 \N:6 short:1 tt0000001:0] panic: runtime error: slice bounds out of range [60:42]

Is it me doing something non-idiomatic or is this some concurrency issue? For your convenience, I have the complete code here

Thank you in advance Bohdan

jcharum commented 1 year ago

(*tsv.Reader).Read is not safe to call concurrently. This is consistent with other "read" APIs, like (*encoding/csv.Reader).Read and (io.Reader).Read.