Slice question - Githubissues

jpgu07 commented 9 months ago

Hi, thanks in advance for the help with this library. I've got a quick question about using the slice function in a Dataset.

Is there a way to slice a Dataset by columns? I'm asking because when you slice multiple columns together, it gives you a TypedArray where the rows come first and then the columns.

// Let's say we have this dataSet
dataSet.shape; // [100, 10], which I interpret as 100 rows and 10 columns

// This does give me all the rows for that column
dataSet.slice([[], [0,1]]); // A TypedArray with a size of 100

// But this gives a TypedArray of Size 200 by row, with the first 2 elements from row 1, then 2 from row 2, and so on until row 100
// I was expecting the first 100 elements from column 1 and then the 100 from column 2
dataSet.slice([[], [0, 2]]);

The issue is, I'm working with a massive Dataset and need to process it by column. Slicing it column by column is just too time-consuming (it's 4492 columns and 300k rows), so I've been slicing this Dataset into several arrays and then converting this data to a column format in memory. This approach isn't very efficient, and the logic for calculating the position of each element in a given column is pretty complex.

Any advice or suggestions would be really helpful. Thank you.

bmaranville commented 9 months ago

One quick question that has profound performance implications: is your data compressed, or chunked for some other reason? If so, what are the chunk dimensions?

jpgu07 commented 9 months ago

This data is in PRODML format, which means that it's just a couple of Groups and Datasets.

I just read it like this:

let h5File = new h5wasm.File(filePath, "r");
let dataSet = h5File.get("Acquisition/Raw[0]/RawData");
dataSet.shape; // (2) [300000, 4492]

There's nothing that suggests that his data is compressed or chunked in any way. Two important notes, this is just an example file but all follow the same format, and I have no control over these data, I just must process it. Thanks.

bmaranville commented 9 months ago

can you also paste dataSet.metadata?

jpgu07 commented 9 months ago

Sure thing, this is the metadata for the main group and the dataset.

AcquisitionId: <redacted>
Count: 1347600000n
Dimensions: (2) ['time', 'locus']
MaximumFrequency: 2500
MeasurementStartTime: '2023-10-11T03:56:27.010050Z'
MinimumFrequency: 0
NumberOfLoci: 4492
OutputDataRate: 5000
PartEndTime: '2023-10-11T03:57:27.069650Z'
PartStartTime: '2023-10-11T03:56:27.010050Z'
PulseRate:20000
PulseWidth:20
PulseWidthUnit: 'ns'
RawDescription: 'Single Pulse, SR: 1.5, OCP: 1'
schemaVersion: '2.0'
SpatialSamplingInterval: 1.0209523439407349
SpatialSamplingIntervalUnit: 'm'
StartIndex: 0n
StartLocusIndex: 0
TriggeredMeasurement: 'false'
VendorCode: 'OptaSense IU Setup 1.7.4'

bmaranville commented 9 months ago

Ah, sorry - I meant the hdf5 metadata - in your snippet above where you pasted dataSet.shape, if you then followed that with dataSet.metadata, it would show extended metadata about the hdf5 properties of the dataset itself.

e.g., in a test file I made:

> f.get('uncompressed').shape
[ 300000, 4492 ]
> f.get('uncompressed').metadata
{
  signed: true,
  type: 0,
  cset: -1,
  vlen: false,
  littleEndian: true,
  size: 4,
  shape: [ 300000, 4492 ],
  maxshape: [ 300000, 4492 ],
  chunks: null,
  total_size: 1347600000
}
>

jpgu07 commented 9 months ago

Alright

chunks: (2) [1846, 71]
cset: -1
littleEndian: true
maxshape: (2) [300000, 4492]
shape: (2) [300000, 4492]
signed: true
size: 2
total_size: 1347600000
type: 0
vlen: false

bmaranville commented 9 months ago

Ok - my first piece of advice is to retrieve data in slices that align with the chunk size as much as possible. Chunked data is stored non-contiguously on disk, so chunks are retrieved individually as needed. For the dataset you posted above, this would mean taking slices of 71 columns at a time. It will take about the same amount of time to get the slice [[], [3*71, 4*71]] as it would to get a single column. (There is a chunk cache, but I think it's 1MB by default and that won't really help you when your columns are 300K in length.)

Second, I looked through the documentation for the HDF5 library and there is no simple way to retrieve data from an HDF5 Dataset in column-major order. Even for the Fortran HDF5 library, it looks like the data is reordered after it is retrieved. So unfortunately my advice is to continue doing what you're doing, mostly. Depending on what your processing looks like, you might be able to speed it up by hardcoding an indexing function for walking through the data, e.g.

function get_index(row, col, num_cols=71) {
  return row * num_cols + col;
}

const total_num_rows = 300000;
const total_num_cols = 4492;
const col_chunk_size = 71;

for (let start_col=0; start_col < total_num_cols; start_col += col_chunk_size) {
  end_col = Math.min(total_num_cols, start_col + col_chunk_size);
  slice_data = f.get('data').slice([[], [start_col, end_col]]);
  const num_cols = end_col - start_col;
  for (let col=0;  col < num_cols; col++) {
    for (let row=0; row < total_num_rows; row++) {
      process_data(slice_data[get_index(row, col, col_chunk_size)]);
    }
  }
}

I implied from what you wrote above that you want to process an entire column at a time... if you don't need to do that, in general it's still true that it will be faster to retrieve blocks of data that align with chunk boundaries.

jpgu07 commented 9 months ago

I will give it a try. Thank you so much for the chunk explanation.

usnistgov / h5wasm

Slice question #66