huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.64k stars 26.43k forks source link

audio classification official script on local own dataset #24143

Closed flckv closed 1 year ago

flckv commented 1 year ago

System Info

Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] torch==2.0.1 [pip3] torchaudio==2.0.2 [conda] numpy 1.24.3 pypi_0 pypi [conda] torch 2.0.1 pypi_0 pypi [conda] torchaudio 2.0.2 pypi_0 pypi

Who can help?

@sanchit-gandhi @sgugger @albertvillanova

Information

Tasks

Reproduction

  1. I want to run this model but not on superb dataset: https://github.com/huggingface/transformers/blob/main/examples/pytorch/audio-classification/README.md

  2. I want to load a dataset from local:

  • here is the local data structure for splits: image

  • here is the csv file structure containing the path to the audio file and the audio label: image

with command: I don't specify the superb dataset:

python run_audio_classification.py \ --model_name_or_path facebook/wav2vec2-base \ --output_dir wav2vec2-base-s \ --overwrite_output_dir \ --remove_unused_columns False \ --do_train \ --do_eval \ --fp16 \ --learning_rate 3e-5 \ --max_length_seconds 1 \ --attention_mask False \ --warmup_ratio 0.1 \ --num_train_epochs 5 \ --per_device_train_batch_size 32 \ --gradient_accumulation_steps 4 \ --per_device_eval_batch_size 32 \ --dataloader_num_workers 4 \ --logging_strategy steps \ --logging_steps 10 \ --evaluation_strategy epoch \ --save_strategy epoch \ --load_best_model_at_end True \ --metric_for_best_model accuracy \ --save_total_limit 3 \ --seed 0 \ --push_to_hub \ --use_auth_token True

  1. Changes I made in the run_audio_classification.py script to load audio from csv file:

3.1 I specify the location of the csv files :

so I replace lines 249 - 261

with:

  data_files = {'train': 'train.csv', 'test': 'test.csv', 'valid': 'valid.csv'}

  raw_datasets["train"] = load_dataset('s/data/s/s/train', data_files=data_files["train"])
  raw_datasets["test"] = load_dataset('s/data/s/s/test', data_files=data_files["test"]) 
  raw_datasets["valid"] = load_dataset('s/data/s/s/valid', data_files=data_files["valid"])

It seems that loading the csv files is successful. I get message: "Dataset csv downloaded and prepared ".

But these are the errors:

  1. I comment out lines 262 -274

    Because no matter how I change the audio path in csv files to audio, file_name, train/test/valid it still gives me error: ValueError: --audio_column_name audio not found in dataset 'None'. Make sure to set--audio_column_nameto the correct audio column - one of train.

    even though I successfully load the csv file with 'audio'and 'label' headers. (also tried: 'filnename' instead of 'audio'). The csv files are "Dataset csv downloaded and prepared ". However, the error says that the --audio_columnname audio is not found

  2. Then I receive error:

on raw_datasets = raw_datasets.cast_column( python3.8/site-packages/datasets/dataset_dict.py line 309, in cast_column self._check_values_type() line 45, in _check_values_type raise TypeError(f"Values in DatasetDict should be of type Dataset but got type '{type(dataset)}'")

TypeError: Values in DatasetDict should be of type Dataset but got type '<class 'datasets.dataset_dict.DatasetDict'>'

(I am loading it locally because I have not received a reply on how to load private hub datasets when I raised the issue: https://github.com/huggingface/datasets/issues/5930 ) @albertvillanova

Expected behavior

I want to be able to run the official example script run_audio_classification.py instead of predefined dataset superb, but on my own local dataset to train the model on my dataset.

amyeroberts commented 1 year ago

Hi @flckv, thanks for raising an issue!

The error messages are telling you what the issues are.

  1. The feature audio isn't in the csv. The csv has two column names: train and label. You should either update the csv to have audio as a column name, or passing in --audio_column_name train when you run the script

  2. The dataset created is a DatasetDict with DatasetDict objects as its keys rather than the expected Dataset instance. This should be resolved by doing:

data_files = {'train': 'train/train.csv', 'test': 'test/test.csv', 'valid': 'valid/valid.csv'}
raw_datasets = load_dataset("s/data/s/s", data_files=data_files)

For further questions about how to customise a script, please ask in our forums. We try to reserve the github issues for feature requests and bug reports.

flckv commented 1 year ago

Thank you, @amyeroberts

sanchit-gandhi commented 1 year ago

See related: https://discuss.huggingface.co/t/custom-local-data-loading-generating-split-with-load-dataset-not-working-values-in-datasetdict-should-be-of-type-dataset-but-got-type-class-datasets-dataset-dict-datasetdict/42740/2?u=sanchit-gandhi