Closed bachvudinh closed 3 weeks ago
here are some examples: query containing complicated http link:
query that input a complicated sentence or paragraph:
Sentence has Q:
A:
Adding punctuation, correcting cases and adding space make no sense for a voice instruction
This type of math is quite unsuitable when it comes to voice instruction or even coding
weird symbol -> this could be hard for creating audio
Also take a close look at sample has ":"
The audio file is unclear. (check the sample 2)
https://huggingface.co/datasets/jan-hq/instruction-speech-v1.5-conversation
In conversation, i think that special tokens such as "(", ")", "{" will never be spoken out. Maybe we should filter our these tokens before send it to TTS.
Note after reviewing:
Note after reviewing:
- We might not need Bert for these
We should have a filter to exclude:
- Specific tasks
- Special tokens
I came across this tool. https://github.com/modelscope/data-juicer?tab=readme-ov-file. Hopefully it's helpful.
OMG found an series of audio instruction dataset that can be used to benchmark our models: https://huggingface.co/AudioLLMs. They used a lot of data from open benchmark and keep the cleaned and audio-compatible data. @hahuyhoang411
Done for first release
Identify and list all data items with queries that are not suitable for use as input sounds in training data: jan-hq/instruction-speech-v1.5 and jan-hq/instruction-speech-v1.