How work with data-set. Or question about exemple for work with data-set

anioji commented 1 month ago

Hello. I wanted to ask.

The recipe says a lot about the requirements for the data set and, as I understand it, a fairly advanced technology stack is used to assemble the data set and train the model.

But, I think that it will generally be difficult for novice users (like me) to understand how to compose and how to give to script a data set in order make their own or model based on your model.

There is no clear instruction or tool that would help people deal with their wav or mp3 files automatically without unnecessary intervention

Not everyone can use this technology stack and I wish there was an easier way recipe step-by-step. Or examples of steps on Google Colab of how you do it.

Because it’s difficult for me to immediately understand what needs to be done. Because I personally, like many who looked here, have not used Parquet - tables or DataSpeach, not much else that members of the Huggingface community use

ylacombe commented 4 weeks ago

Hey @anioji, thanks for opening this issue, have you tried to follow the README from dataspeech ? If not, please let me know where you're facing issues!

In any case, I'll soon make a simpler guide to allow easy fine-tuning on English, and try to make the training from scratch simpler!

anioji commented 3 weeks ago

I'm reading Data Speech. I understand that to create my own dataset I need to use Datasets from Huggingface (Was not described in the recipe)

And I'm still confused about the parquet table structure.

What should the columns be called before being sent to data-speach and are the methods I have chosen correct?

My post-annotation preview says I have an object in the audio column, the JSON view shows it is an object: {"bytes": null, "path": "path/to/file/maybe"}

import datasets as dt 
from datasets import Audio

aulist = []
for audio_file in os.listdir("./audio"):
    aulist.append(f"./audio/{audio_file}")

df = dt.Dataset.from_dict({'audio': aulist}).cast_column("audio", Audio(sampling_rate=16000))

# df.push_to_hub(repo_id="#####", token="##########")
df.to_parquet("./audio3.parquet")

Parquet in JSON view

{"audio":{"bytes":null,"path":"./audio/chunk24.wav"}}
{"audio":{"bytes":null,"path":"./audio/chunk9.wav"}}
{"audio":{"bytes":null,"path":"./audio/chunk7.wav"}}
{"audio":{"bytes":null,"path":"./audio/chunk0.wav"}}
{"audio":{"bytes":null,"path":"./audio/chunk25.wav"}}
{"audio":{"bytes":null,"path":"./audio/chunk2.wav"}}

Its JSON view of parquet which i pushed on HuggingFace

And this is just the beginning of my independent search. Any other beginner may not decide to do this at all.

For this reason, I will repeat

Parler-TTS does not provide step-by-step instructions or tools and relies on those who have already worked with Huggingface tools

And as for me, this is a problem, both in the use of Parler-TTS and in creating your own versions based on it

ylacombe commented 3 weeks ago

Hey @anioji, thanks for bringing more details! As said, I'm working on making this a bit clearer. In the meantime, here are a few recommendations for your particular use-case:

-> you don't need to take care of how the dataset is saved when using the datasets library. Instead, here is a quick recipe on how to work with your local files:

You first need to create a csv file that contains the full paths to the audio. Be careful, in your example you only provide relative paths to audio.. The column name for those audio files could be for example audio, but you can use whatever you want. You also need a column with the transcriptions of the audio, this column can be named transcript but you can use whatever you want.
Once you have this csv file, you can load it to a dataset like this:
```
from datasets import DatasetDict
```

dataset = DatasetDict.from_csv({"train": PATH_TO_CSV_FILE})

3. You then need to convert the audio column name to [`Audio`](https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/main_classes#datasets.Audio) so that `datasets` understand that it deals with audio files.
```python
from datasets import Audio
dataset = dataset.cast_column("audio", Audio())

You can then save the datasets locally or push it to the hub:
```
dataset.push_to_hub(REPO_ID)
```

Note that you can make the dataset private by passing private=True to the push_to_hub method. Find other possible arguments here.

When using data-speech scripts, you can then REPO_ID (replace this by the name you want here and above) as the dataset name.

ylacombe commented 3 weeks ago

Hey @anioji, I've also updated the README to facilitate preparing your datasets for fine-tuning: https://github.com/huggingface/dataspeech?tab=readme-ov-file#annotating-datasets-to-fine-tune-parler-tts

huggingface / parler-tts

How work with data-set. Or question about exemple for work with data-set #27