jamessdixon / Kaggle.HomeDepot

Repo for Kaggle Competiton
MIT License
11 stars 10 forks source link

Fix train/test data ordering #15

Closed taylorwood closed 8 years ago

taylorwood commented 8 years ago

PSeq.map doesn't preserve the order of the input sequence, so the output was "scrambled". I noticed this after using PSeq.map to parallelize the output CSV; the IDs were unsorted. I think this might've significantly affected the score.

jamessdixon commented 8 years ago

I don't think so b/c they match by ID -> but it is worth checking. Easy enough to test.

taylorwood commented 8 years ago

Is it cool if I upload a submission? I think the score will change because I compared the before/after output and the per-ID relevancies were pretty different.

jamessdixon commented 8 years ago

Go ahead. We have 4 for the day.

jamessdixon commented 8 years ago

Also, you might want to leave the pseq in and add 1 more order clause on the id

taylorwood commented 8 years ago

You improved on your best score by 0.02779. You just moved up 67 positions on the leaderboard.

I'll look into using PSeq.map again while preserving the order of the data. I think the reason it matters is because the output is zipped with the CSV input rows, which are always in the "right" order, but the PSeq.map output isn't.

jamessdixon commented 8 years ago

Sweet!

On Mon, Feb 1, 2016 at 8:58 AM, Taylor Wood notifications@github.com wrote:

You improved on your best score by 0.02779. You just moved up 67 positions on the leaderboard.

I'll look into using PSeq.map again while preserving the order of the data. I think the reason it matters is because the output is zipped with the CSV input rows, which are always in the "right" order, but the PSeq.map output isn't.

— Reply to this email directly or view it on GitHub https://github.com/jamessdixon/Kaggle.HomeDepot/pull/15#issuecomment-177983034 .