flckv commented 1 year ago

System Info

transformers version: 4.30.0.dev0
Platform: Linux-5.4.204-ql-generic-12.0-19-x86_64-with-glibc2.17
Python version: 3.8.12
Huggingface_hub version: 0.15.1
Safetensors version: 0.3.1
PyTorch version (GPU?): 2.0.1+cu117 (True)

Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] torch==2.0.1 [pip3] torchaudio==2.0.2 [conda] numpy 1.24.3 pypi_0 pypi [conda] torch 2.0.1 pypi_0 pypi [conda] torchaudio 2.0.2 pypi_0 pypi

Who can help?

@sanchit-gandhi @sgugger @albertvillanova

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I want to run this model but not on superb dataset: https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/README.md
I want to load a dataset from local:

here is the local data structure for splits:

here is the csv file structure containing the path to the audio file and the audio label:

with command: I don't specify the superb dataset:

python run_audio_classification.py \ --model_name_or_path facebook/wav2vec2-base \ --output_dir wav2vec2-base-s \ --overwrite_output_dir \ --remove_unused_columns False \ --do_train \ --do_eval \ --fp16 \ --learning_rate 3e-5 \ --max_length_seconds 1 \ --attention_mask False \ --warmup_ratio 0.1 \ --num_train_epochs 5 \ --per_device_train_batch_size 32 \ --gradient_accumulation_steps 4 \ --per_device_eval_batch_size 32 \ --dataloader_num_workers 4 \ --logging_strategy steps \ --logging_steps 10 \ --evaluation_strategy epoch \ --save_strategy epoch \ --load_best_model_at_end True \ --metric_for_best_model accuracy \ --save_total_limit 3 \ --seed 0 \ --push_to_hub \ --use_auth_token True

Changes I made in the run_audio_classification.py script to load audio from csv file:

3.1 I specify the location of the csv files :

so I replace lines 249 - 261

with:

  data_files = {'train': 'train.csv', 'test': 'test.csv', 'valid': 'valid.csv'}

  raw_datasets["train"] = load_dataset('s/data/s/s/train', data_files=data_files["train"])
  raw_datasets["test"] = load_dataset('s/data/s/s/test', data_files=data_files["test"]) 
  raw_datasets["valid"] = load_dataset('s/data/s/s/valid', data_files=data_files["valid"])

It seems that loading the csv files is successful. I get message: "Dataset csv downloaded and prepared ".

But these are the errors:

I comment out lines 262 -274

Because no matter how I change the audio path in csv files to audio, file_name, train/test/valid it still gives me error: ValueError: --audio_column_name audio not found in dataset 'None'. Make sure to set--audio_column_nameto the correct audio column - one of train.

even though I successfully load the csv file with 'audio'and 'label' headers. (also tried: 'filnename' instead of 'audio'). The csv files are "Dataset csv downloaded and prepared ". However, the error says that the --audio_columnname audio is not found
Then I receive error:

on raw_datasets = raw_datasets.cast_column( python3.8/site-packages/datasets/dataset_dict.py line 309, in cast_column self._check_values_type() line 45, in _check_values_type raise TypeError(f"Values in DatasetDict should be of type Dataset but got type '{type(dataset)}'")

TypeError: Values in DatasetDict should be of type Dataset but got type '<class 'datasets.dataset_dict.DatasetDict'>'

(I am loading it locally because I have not received a reply on how to load private hub datasets when I raised the issue: https://github.com/huggingface/datasets/issues/5930 ) @albertvillanova

Expected behavior

I want to be able to run the official example script run_audio_classification.py instead of predefined dataset superb, but on my own local dataset to train the model on my dataset.

amyeroberts commented 1 year ago

Hi @flckv, thanks for raising an issue!

The error messages are telling you what the issues are.

The feature audio isn't in the csv. The csv has two column names: train and label. You should either update the csv to have audio as a column name, or passing in --audio_column_name train when you run the script
The dataset created is a DatasetDict with DatasetDict objects as its keys rather than the expected Dataset instance. This should be resolved by doing:

data_files = {'train': 'train/train.csv', 'test': 'test/test.csv', 'valid': 'valid/valid.csv'}
raw_datasets = load_dataset("s/data/s/s", data_files=data_files)

For further questions about how to customise a script, please ask in our forums. We try to reserve the github issues for feature requests and bug reports.

flckv commented 1 year ago

Thank you, @amyeroberts

sanchit-gandhi commented 1 year ago

huggingface / transformers

audio classification official script on local own dataset #24143