Changing from shuffle() to something faster - Githubissues

krisy / kaggle

kaggle

1 stars 1 forks source link

Changing from shuffle() to something faster #1

Open krisy opened 11 years ago

krisy commented 11 years ago

When selecting random rows from the dataset, the data is shuffled, then the first k rows are selected. This means O(n*logn) complexity. Instead we should you e.g. the following:

create a set
while the size of the set is < k
generate random number at most n
store in the set (this way if the same number is generated two times, the size of the set doesn't grow!)
we ha random indeces now, select the corresponding rows from the set
complexity: O(k)

TODO: can numpy generate somethink like that?