twitter-archive / torch-dataset

An extensible and high performance method of reading, sampling and processing data for Torch
Apache License 2.0
76 stars 24 forks source link

IndexCSV Datasets are not partitioned correctly #16

Closed willfrey closed 8 years ago

willfrey commented 8 years ago

I'm trying to split a dataset with 12800 examples across four nodes. Instead of each node receiving 3200 examples, it appears that they receive 0, 12771, 27, and 2, respectively.

Can you help me understand this behavior and try to resolve it?

Thanks.

willfrey commented 8 years ago

I've deduced that it's because I'm using a ton of different labels. I'm trying to use this for speech data, so I'm using the label column to store the transcription text.

Do you have any suggestions on a better way to store the text?

Perhaps I could do it by looking up the filenames returned from the sampler?

willfrey commented 8 years ago

Okay, that worked!

I created a big json file to store the transcriptions based on the filenames. I can look them up using batch.item[i].url.

Closing this. :)