homebrewltd / llama3-s

Llama3.1 learns to Listen
127 stars 4 forks source link

Epic: Training Data Quality Checking. #5

Closed bachvudinh closed 3 weeks ago

bachvudinh commented 1 month ago

Identify and list all data items with queries that are not suitable for use as input sounds in training data: jan-hq/instruction-speech-v1.5 and jan-hq/instruction-speech-v1.

bachvudinh commented 1 month ago

here are some examples: query containing complicated http link:

Screenshot_2024-07-12_at_15 12 17

query that input a complicated sentence or paragraph:

Screenshot_2024-07-12_at_14 49 25
hahuyhoang411 commented 1 month ago

Sentence has Q: A: image

hahuyhoang411 commented 1 month ago

Adding punctuation, correcting cases and adding space make no sense for a voice instruction image

image image image

hahuyhoang411 commented 1 month ago

This type of math is quite unsuitable when it comes to voice instruction or even coding

image

image

image

hahuyhoang411 commented 1 month ago

weird symbol -> this could be hard for creating audio image

hahuyhoang411 commented 1 month ago

Also take a close look at sample has ":" image

The audio file is unclear. (check the sample 2)

https://huggingface.co/datasets/jan-hq/instruction-speech-v1.5-conversation

hungphongtrn commented 1 month ago

In conversation, i think that special tokens such as "(", ")", "{" will never be spoken out. Maybe we should filter our these tokens before send it to TTS. image

hahuyhoang411 commented 1 month ago

Note after reviewing:

hungphongtrn commented 1 month ago

Note after reviewing:

  • We might not need Bert for these
  • We should have a filter to exclude:

    • Specific tasks
    • Special tokens

I came across this tool. https://github.com/modelscope/data-juicer?tab=readme-ov-file. Hopefully it's helpful.

bachvudinh commented 1 month ago

OMG found an series of audio instruction dataset that can be used to benchmark our models: https://huggingface.co/AudioLLMs. They used a lot of data from open benchmark and keep the cleaned and audio-compatible data. @hahuyhoang411

tikikun commented 3 weeks ago

Done for first release