Hi there! I'm looking into utilizing FFCV for genomics applications. In the process, I tried using the BytesField with a simple dataset to familiarize myself with its behavior. Am I using the API incorrectly?
pip list | grep ffcv = ffcv 1.0.2
>>> ffcv.__version__
'0.0.3rc1'
MRE
import torch as ch
from torch.utils.data import Dataset
import numpy as np
import ffcv
from ffcv.fields import BytesField
class FooDS(Dataset):
def __init__(self):
self.data = np.arange(5, dtype=np.uint8)
def __len__(self):
return 2
def __getitem__(self, idx: int):
if idx == 0:
return self.data[:3]
else:
return self.data[3:]
ds = FooDS()
writer = ffcv.DatasetWriter('foo.beton', {'bytes': BytesField()})
writer.from_indexed_dataset(ds)
loader = ffcv.Loader(
'foo.beton',
batch_size=1,
num_workers=1,
order=ffcv.loader.OrderOption.SEQUENTIAL,
pipelines={'bytes': [BytesField().get_decoder_class()()]}
)
for batch in loader:
print(batch)
Expected
(array([[0, 1, 2]], dtype=uint8),)
(array([[3, 4]], dtype=uint8),) # or maybe (array([[3, 4, 0]], dtype=uint8),) if the data is automatically padded
For more context, I'm hoping to rapidly process DNA sequences with FFCV. To dramatically reduce on-disk footprint, I want to store variable length genotypes with FFCV, these are sufficient to reconstruct the much larger DNA sequences on-the-fly. In this setting, each instance from the dataset passed to FFCV would have two fields with a final length dimension that varies across instances.
"genotypes": shape = (2, length) dtype = int8
"positions": shape = (length) dtype = uintp
I'm hoping I can do this by implementing a dataset that views the data as uint8 and ravels it, and then add a transform to decode the data back to the intended shape and dtype. This could also reconstruct the DNA sequences which have uniform length across instances. Is this possible with FFCV? I would appreciate any recommendations, thank you!
Hi there! I'm looking into utilizing FFCV for genomics applications. In the process, I tried using the BytesField with a simple dataset to familiarize myself with its behavior. Am I using the API incorrectly?
pip list | grep ffcv
= ffcv 1.0.2MRE
Expected
Actual
For more context, I'm hoping to rapidly process DNA sequences with FFCV. To dramatically reduce on-disk footprint, I want to store variable length genotypes with FFCV, these are sufficient to reconstruct the much larger DNA sequences on-the-fly. In this setting, each instance from the dataset passed to FFCV would have two fields with a final length dimension that varies across instances.
I'm hoping I can do this by implementing a dataset that views the data as uint8 and ravels it, and then add a transform to decode the data back to the intended shape and dtype. This could also reconstruct the DNA sequences which have uniform length across instances. Is this possible with FFCV? I would appreciate any recommendations, thank you!