gonum / hdf5

hdf5 is a wrapper for the HDF5 library
BSD 3-Clause "New" or "Revised" License
131 stars 33 forks source link

Use of HDF5 1.12 #85

Open tomas-kucera opened 2 years ago

tomas-kucera commented 2 years ago

What are you trying to do?

I am trying to read HDF5 database (version 1.12.1).

The datatabase was populated using Python's h5py library. The data is pandas dataframe but guess that should be no problem as h5ls and HDFView app read the data without any issues.

What did you do?

I used the example from this repo for reading table. Also tried to use DataSet instead. This is the code excerpt:

package main

import (
  "fmt"
  "gonum.org/v1/hdf5"
)

type ohlcv struct {
  Index      int64   `hdf5:"index"`
  Exchange   string  `hdf5:"exchange"`
  Pair       string  `hdf5:"pair"`
  Timestamp  int64   `hdf5:"timestamp"`
  PriceOpen  float64 `hdf5:"price_open"`
  PriceHigh  float64 `hdf5:"price_high"`
  PriceLow   float64 `hdf5:"price_low"`
  PriceClose float64 `hdf5:"price_close"`
  Volume     float64 `hdf5:"volume"`
}

func main() {
  version, _ := hdf5.LibVersion()
  fmt.Printf("HDF5 version: %s\n", version)

  file, _ := hdf5.OpenFile("tickers.h5", hdf5.F_ACC_RDONLY)
  month, _ := file.OpenGroup("M11")
  day, _ := month.OpenGroup("D07")
  table, _ := day.OpenTable("table")

  recs, _ := table.NumPackets()

  for i := 0; i != recs; i++ {
    p := make([]ohlcv, 1)
    if err := table.Next(&p); err != nil {
      panic(fmt.Errorf("next failed: %s", err))
    }
    fmt.Printf("data[%d]: O:%.2f H:%.2f L:%.2f C:%.2f V:%.2f \n", i, p[0].PriceOpen, p[0].PriceHigh, p[0].PriceLow, p[0].PriceClose, p[0].Volume)

  file.Close()
}

What did you expect to happen?

I expected something like this:

HDF5 version: 1.12.1
data[0]: O:62829.33 H:62858.35 L:62829.32 C:62853.66 V:10.72221
data[1]: O:62853.66 H:62920.04 L:62851.32 C:62896.75 V:10.19546
...
data[1439]: O:63276.08 H:63286.35 L:63250.01 C:63273.59 V:43.11052

What actually happened?

What I get is:

HDF5 version: 1.12.1
data[0]: O:24533265083020748587221761909950877822199906846513430683666835688641707196344354649178734577047675756970784403964996179506865859538714624.00 H:153999479823021862704498665709509248968354775291789269717488570675195022731875416084608859555430794393831940365058635304153349319996889497485119259215244127082639950809210292371944342687481593856.00 L:0.00 C:0.00 V:0.00 
data[1]: O:-0.00 H:11485591669347015527702671166617436553216.00 L:0.00 C:0.00 V:0.00 
data[2]: O:16786184717166469080015018654342952761822206471285346540228583460481189475663073161248768.00 H:116860917747596761471525066204868691258239771993742452785440191
...
data[1433]: O:-9500707167603260.00 H:59636916704940832875429063464307788500085761805873313238334329889516158976.00 L:0.00 C:0.00 V:0.00 
data[1434]: O:5019141222517546172965875509332335194542251462150973230580526953620246797924906675853593091481301449281850864438229980645461301058991845935258992265931878573396108562393519846894098059723698237228341624032758439241216420110000323300523402260850195612038020143058501041685903495909081088.00 H:-14749955137625195020933306096366472509413755436838140703864665704181336175168284990050108620591141868165148740370361311351055003224895967047494566278255701287614003245498688726479296978595350966497450734598780051966490510138076888153184354240036864.00 L:0.00 C:0.00 V:0.00 
...
data[1438]: O:-0.00 H:40804893379413961024208896.00 L:0.00 C:0.00 V:0.00 
data[1439]: O:11485478191699172345758915201790495424512.00 H:-0.00 L:0.00 C:0.00 V:0.00 

Also when trying to access Exchange or Pair attributes, I get the following error:

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xb01dfacedebac1e pc=0x405fbc9]

What version of Go, Gonum, Gonum/netlib and libhdf5 are you using?

go version go1.17 darwin/amd64
gonum.org/v1/hdf5
h5cc -showconfig (excerpt)
General Information:
-------------------
HDF5 Version: 1.12.1
Configured on: Mon Jul 12 08:05:03 BST 2021
Configured by: brew@BigSur
Host system: x86_64-apple-darwin20.4.0
Uname information: Darwin BigSur 20.4.0 Darwin Kernel Version 20.4.0: Thu Apr 22 21:46:47 PDT 2021; root:xnu-7195.101.2~1/RELEASE_X86_64 x86_64
Byte sex: little-endian
Installation point: /usr/local/Cellar/hdf5/1.12.1

Does this issue reproduce with the current master?

Yes, it does!

sbinet commented 2 years ago

I surmise this:

p := make([]ohlcv, 1)

needs to read instead:

var p ohlcv

and thus:

fmt.Printf("data[%d]: O:%.2f H:%.2f L:%.2f C:%.2f V:%.2f \n", i, p.PriceOpen, p.PriceHigh, p.PriceLow, p.PriceClose, p.Volume)
tomas-kucera commented 2 years ago

Thanks for a quick response.

Took that code from the example but also tried what you are suggesting. Unfortunately that leads to this error: panic: unsupported kind (struct), need slice or array

which makes sense, as the function definition is:

func (*hdf5.Table).Next(data interface{}) error
(hdf5.Table).Next on pkg.go.dev

Next reads packets from a packet table starting at the current index into the value pointed at by data. i.e. data is a pointer to an array or a slice.

Can it be that you are using some newer version?

EDIT: Just checked the implementations of the Next function in h5pt_table.go and they are identical. EDIT 2: BTW, I also tried ReadPackets instead of Next with the same results.

tomas-kucera commented 2 years ago

Did some more research!

If I replace the printing line with

fmt.Printf("data[%d]: %v\n", i, p)

Then there are two possible results dependant on the defintion of the struct:

  1. full definition that includes the strings fails with this error comming from the fmt.Printf(): panic: runtime error: growslice: cap out of range

  2. if the string are commented out, then the result is this:

    data[1437]: [{1625183880000000000 7090182514096892258 1.814982667395619e-306 -3.79181233146521e-284 -4.643804396672689e-134 -9.500616071346912e+15 3.5854690526542615e+184}]
    data[1438]: [{1625183940000000000 7090182514096892258 1.814982667395619e-306 5.9896317349078915e+183 3.434212986107372e+237 3.58550892285317e+184 -9.919075148868785e-38}]
    data[1439]: [{1625184000000000000 7090182514096892258 1.814982667395619e-306 -3.3900115496356115e+111 -2.0293221659741413e+112 -3.177424435398634e-182 1.1485478191699172e+40}]

    where the first column (Index) is perfectly correct but the rest is just messed up.

This leads me to think that the reading ignores the `hdf5:"column_name"` and reads the values in sequence and thus causing to mess up the data completely.

This hypothesis is somewhat being broken by the fact that even if I leave the struct defintion full (including the strings) then the Next passes and if I do not attempt to print the string values (Exchange / Pair), then the values are displayed but wrong. Which is the original output.

I am being totally lost.

But have a simple question: How does handling string in structs for reading from HDF5 work?

I have noticed that in the master/cmd/test-go-table-01-readback/main.go file there is definition of struct:

type particle struct {
    // name        string  `hdf5:"Name"`      // FIXME(sbinet)
    Lati        int32   `hdf5:"Latitude"`
    Longi       int64   `hdf5:"Longitude"`
    Pressure    float32 `hdf5:"Pressure"`
    Temperature float64 `hdf5:"Temperature"`
    // isthep      []int                     // FIXME(sbinet)
    // jmohep [2][2]int64                    // FIXME(sbinet)
}

That somehow indicates that strings can be an issue.