Closed lowkeywx closed 1 year ago
I am very glad to receive your reply. After sorting, dict becomes list. I tested form 0 to max loop and use zip, and found that zip is more efficient. I've changed the assert in the loop to continue
This is not quite what i meant, when one item is missing here, there will be a large mismatch, until there's one missing from the other side. Using a for i in range():
allows you to check both sides if they have the file corresponding to the number i
, without desyncing when one is missing from either side like it does with a list.
In my opinion, any errors in the training data should be dealt with before training to prevent them from affecting the training. So I prefer to align the data before training and detect anomalies in the data.
for i in range(len(data_x)):
x = data_x.get(i)
y = data_y.get(i)
if x is None or y is None:
print(f'The training data does not match. key={i}')
continue
Is this what you expected. I'm not good at python, so my code might not be good.
In my opinion, any errors in the training data should be dealt with before training to prevent them from affecting the training. So I prefer to align the data before training and detect anomalies in the data.
for i in range(len(data_x)): x = data_x.get(i) y = data_y.get(i) if x is None or y is None: print(f'The training data does not match. key={i}') continue
Is this what you expected. I'm not good at python, so my code might not be good.
Using alignment based on file name ensures no anomalies will affect the training, unless the user replaced a file specifically.
This is what I expected, although I would use max(len(data_x), len(data_y))
instead of just len(data_x)
, in case data_y
is longer than data_x
.
I changed the code, thank you for your help
Thank you for your contribution
This is mainly for sorting everything, right? And erroring in case there's a missing file in the dataset. Since you're using dicts, maybe you could get the max value, loop from 0 to max and get those entries from the dict, that would possibly be faster.
Also, if you decide to implement it with that loop, you can also use
continue
instead ofassert
, since a single missing key wouldn't mean a misaligned dataset.