Epic: Training Data Quality Checking.

homebrewltd / llama3-s

Llama3.1 learns to Listen

127 stars 4 forks source link

Epic: Training Data Quality Checking. #5

Closed bachvudinh closed 3 weeks ago

bachvudinh commented 1 month ago

Identify and list all data items with queries that are not suitable for use as input sounds in training data: jan-hq/instruction-speech-v1.5 and jan-hq/instruction-speech-v1.

bachvudinh commented 1 month ago

here are some examples: query containing complicated http link:

query that input a complicated sentence or paragraph:

hahuyhoang411 commented 1 month ago

Sentence has Q: A:

hahuyhoang411 commented 1 month ago

Adding punctuation, correcting cases and adding space make no sense for a voice instruction

hahuyhoang411 commented 1 month ago

This type of math is quite unsuitable when it comes to voice instruction or even coding

hahuyhoang411 commented 1 month ago

weird symbol -> this could be hard for creating audio

hahuyhoang411 commented 1 month ago

Also take a close look at sample has ":"

The audio file is unclear. (check the sample 2)

https://huggingface.co/datasets/jan-hq/instruction-speech-v1.5-conversation

hungphongtrn commented 1 month ago

In conversation, i think that special tokens such as "(", ")", "{" will never be spoken out. Maybe we should filter our these tokens before send it to TTS.

hahuyhoang411 commented 1 month ago

Note after reviewing:

We might not need Bert for these
We should have a filter to exclude:
- Specific tasks
- Special tokens

hungphongtrn commented 1 month ago

Note after reviewing:

We might not need Bert for these

We should have a filter to exclude:

Specific tasks

Special tokens

I came across this tool. https://github.com/modelscope/data-juicer?tab=readme-ov-file. Hopefully it's helpful.

bachvudinh commented 1 month ago

OMG found an series of audio instruction dataset that can be used to benchmark our models: https://huggingface.co/AudioLLMs. They used a lot of data from open benchmark and keep the cleaned and audio-compatible data. @hahuyhoang411

tikikun commented 3 weeks ago

Done for first release