Open Jourdelune opened 3 months ago
Hi ! I think the issue comes from the fact that you return row
entirely, and therefore the dataset has to re-encode the audio data in row
.
Can you try this instead ?
# map the dataset
def transcribe_audio(row):
audio = row["audio"] # get the audio but do nothing with it
return {"transcribed": True}
PS: no need to iter on the dataset to trigger the map
function on a Dataset
- map
runs directly when it's called (contrary to IterableDataset
taht you can get when streaming, which are lazy)
No, that doesn't change anything, I manage to solve this problem by setting with_indices=True in the map function and directly retrieving the audio corresponding to the index.
from datasets import load_dataset
import time
ds = load_dataset("WaveGenAI/audios2", split="train[:50]")
# map the dataset
def transcribe_audio(row, idx):
audio = ds[idx]["audio"] # get the audio but do nothing with it
row["transcribed"] = True
return row
time1 = time.time()
ds = ds.map(
transcribe_audio, with_indices=True
) # set low writer_batch_size to avoid memory issues
for row in ds:
pass # do nothing, just iterate to trigger the map function
print(f"Time taken: {time.time() - time1:.2f} seconds")
Hmm maybe accessing row["audio"]
makes map()
reencode what's inside row["audio"]
in case there are in-place modifications
Hello, I'm working with an audio dataset. I want to transcribe the audio that the dataset contain, and for that I use whisper. My issue is that the dataset load everything in the RAM when I map the dataset, obviously, when RAM usage is too high, the program crashes.
To fix this issue, I'm using writer_batch_size that I set to 10, but in this case, the mapping of the dataset is extremely slow. To illustrate this, on 50 examples, with
writer_batch_size
set to 10, it takes 123.24 seconds to process the dataset, but withoutwriter_batch_size
set to 10, it takes about ten seconds to process the dataset, but then the process remains blocked (I assume that it is writing the dataset and therefore suffers from the same problem aswriter_batch_size
)Steps to reproduce the bug
Hug ram usage but fast (but actually slow when saving the dataset):
Low ram usage but very very slow:
Expected behavior
I think the processing should be much faster, on only 50 audio examples, the mapping takes several minutes while nothing is done (just loading the audio).
Environment info
datasets
version: 2.21.0huggingface_hub
version: 0.24.5fsspec
version: 2024.6.1Extra
The dataset has been generated by using audio folder, so I don't think anything specific in my code is causing this problem.
Also, it's the combination of
audio = row["audio"]
androw["transcribed"] = True
which causes problems,row["transcribed"] = True
alone does nothing andaudio = row["audio"]
alone sometimes causes problems, sometimes not.