fix: [train] [dataset] fixed data out of order

gitmylo / bark-voice-cloning-HuBERT-quantizer

The code for the bark-voicecloning model. Training and inference.

MIT License

670 stars 111 forks source link

fix: [train] [dataset] fixed data out of order #25

Closed lowkeywx closed 1 year ago

gitmylo commented 1 year ago

This is mainly for sorting everything, right? And erroring in case there's a missing file in the dataset. Since you're using dicts, maybe you could get the max value, loop from 0 to max and get those entries from the dict, that would possibly be faster.

Also, if you decide to implement it with that loop, you can also use continue instead of assert, since a single missing key wouldn't mean a misaligned dataset.

lowkeywx commented 1 year ago

I am very glad to receive your reply. After sorting, dict becomes list. I tested form 0 to max loop and use zip, and found that zip is more efficient. I've changed the assert in the loop to continue

gitmylo commented 1 year ago

This is not quite what i meant, when one item is missing here, there will be a large mismatch, until there's one missing from the other side. Using a for i in range(): allows you to check both sides if they have the file corresponding to the number i, without desyncing when one is missing from either side like it does with a list.

lowkeywx commented 1 year ago

In my opinion, any errors in the training data should be dealt with before training to prevent them from affecting the training. So I prefer to align the data before training and detect anomalies in the data.

        for i in range(len(data_x)):
            x = data_x.get(i)
            y = data_y.get(i)
            if x is None or y is None:
                print(f'The training data does not match. key={i}')
                continue

Is this what you expected. I'm not good at python, so my code might not be good.

gitmylo commented 1 year ago

In my opinion, any errors in the training data should be dealt with before training to prevent them from affecting the training. So I prefer to align the data before training and detect anomalies in the data.
        for i in range(len(data_x)):
            x = data_x.get(i)
            y = data_y.get(i)
            if x is None or y is None:
                print(f'The training data does not match. key={i}')
                continue
Is this what you expected. I'm not good at python, so my code might not be good.

Using alignment based on file name ensures no anomalies will affect the training, unless the user replaced a file specifically.

This is what I expected, although I would use max(len(data_x), len(data_y)) instead of just len(data_x), in case data_y is longer than data_x.

lowkeywx commented 1 year ago

I changed the code, thank you for your help

gitmylo commented 1 year ago

Thank you for your contribution