Hi,
I'm trying to convert several TSV files from c4 200M dataset into HDF5 format and I based my conversion on your notebook.
The dataset is composed by 10 files, each file containing approximately 18 million records, with 2 string columns.
Given the size of the dataset, I thought that converting it to HDF5 format would give a significant benefit and would allow me to know the shape of each file and give significant performance boost in the read of chunks of the dataset.
In a first trial I converted 1 million records in about 3 minutes, however when I tried to convert all 18 million records it is taking more than 6 hours per file.
I am currently loading my tsv in the following way
def csv_to_hf5(csv_path, num_lines=1000000, chunksize=100000, columns=None):
if columns is None:
columns = ['input', 'labels']
csv_path = pl.Path(csv_path)
hdf_filename = csv_path.parent / pl.Path(csv_path).name.replace('.tsv', '.hf5')
# suppose this is a large CSV that does not
# fit into memory:
# Get number of lines in the CSV file if it's on your hard drive:
# num_lines = subprocess.check_output(['wc', '-l', in_csv])
# num_lines = int(nlines.split()[0])
# use 10,000 or 100,000 or so for large files
dt = h5py.special_dtype(vlen=str)
# this is your HDF5 database:
with h5py.File(hdf_filename, 'w') as h5f:
# use num_features-1 if the csv file has a column header
dset1 = h5f.create_dataset('input',
shape=(num_lines,),
compression=9,
dtype=dt
)
dset2 = h5f.create_dataset('labels',
shape=(num_lines,),
compression=9,
dtype=dt
)
# change range argument from 0 -> 1 if your csv file contains a column header
for i in tqdm(range(0, num_lines, chunksize)):
df = pd.read_csv(csv_path,
sep='\t',
names=columns,
header=None, # no header, define column header manually later
nrows=chunksize, # number of rows to read at each iteration
skiprows=i,
) # skip rows that were already read
features = df.input.values.astype(str)
labels = df.labels.values.astype(str)
# use i-1 and i-1+10 if csv file has a column header
dset1[i:i + chunksize] = features
dset2[i:i + chunksize] = labels
where i set num_lines equal to the total lines of each file, where chunksize = 10000.
I did not expect this performance degradation, have you ever tried to use your code to convert dataset of a similar dimension?
Hi, I'm trying to convert several TSV files from c4 200M dataset into HDF5 format and I based my conversion on your notebook.
The dataset is composed by 10 files, each file containing approximately 18 million records, with 2 string columns. Given the size of the dataset, I thought that converting it to HDF5 format would give a significant benefit and would allow me to know the shape of each file and give significant performance boost in the read of chunks of the dataset.
In a first trial I converted 1 million records in about 3 minutes, however when I tried to convert all 18 million records it is taking more than 6 hours per file. I am currently loading my tsv in the following way
where i set num_lines equal to the total lines of each file, where chunksize = 10000.
I did not expect this performance degradation, have you ever tried to use your code to convert dataset of a similar dimension?
Thanks in advance.