libffcv / ffcv

FFCV: Fast Forward Computer Vision (and other ML workloads!)
https://ffcv.io
Apache License 2.0
2.79k stars 180 forks source link

Unexpected output with BytesField, question for genomics #370

Open d-laub opened 3 months ago

d-laub commented 3 months ago

Hi there! I'm looking into utilizing FFCV for genomics applications. In the process, I tried using the BytesField with a simple dataset to familiarize myself with its behavior. Am I using the API incorrectly?

pip list | grep ffcv = ffcv 1.0.2

>>> ffcv.__version__
'0.0.3rc1'

MRE

import torch as ch
from torch.utils.data import Dataset
import numpy as np
import ffcv
from ffcv.fields import BytesField

class FooDS(Dataset):
    def __init__(self):
        self.data = np.arange(5, dtype=np.uint8)

    def __len__(self):
        return 2

    def __getitem__(self, idx: int):
        if idx == 0:
            return self.data[:3]
        else:
            return self.data[3:]

ds = FooDS()
writer = ffcv.DatasetWriter('foo.beton', {'bytes': BytesField()})
writer.from_indexed_dataset(ds)

loader = ffcv.Loader(
    'foo.beton',
    batch_size=1,
    num_workers=1,
    order=ffcv.loader.OrderOption.SEQUENTIAL,
    pipelines={'bytes': [BytesField().get_decoder_class()()]}
)

for batch in loader:
    print(batch)

Expected

(array([[0, 1, 2]], dtype=uint8),)
(array([[3, 4]], dtype=uint8),) # or maybe (array([[3, 4, 0]], dtype=uint8),) if the data is automatically padded

Actual

(array([[0]], dtype=uint8),)
(array([[3]], dtype=uint8),)

For more context, I'm hoping to rapidly process DNA sequences with FFCV. To dramatically reduce on-disk footprint, I want to store variable length genotypes with FFCV, these are sufficient to reconstruct the much larger DNA sequences on-the-fly. In this setting, each instance from the dataset passed to FFCV would have two fields with a final length dimension that varies across instances.

I'm hoping I can do this by implementing a dataset that views the data as uint8 and ravels it, and then add a transform to decode the data back to the intended shape and dtype. This could also reconstruct the DNA sequences which have uniform length across instances. Is this possible with FFCV? I would appreciate any recommendations, thank you!