lorenzoh / DataLoaders.jl

A parallel iterator for large machine learning datasets that don't fit into memory inspired by PyTorch's `DataLoader` class.
https://lorenzoh.github.io/DataLoaders.jl/docs/dev/interactive
MIT License
76 stars 9 forks source link

Reproductivity problem with multi-threading #32

Open terasakisatoshi opened 2 years ago

terasakisatoshi commented 2 years ago

When I used this DataLoaders.jl in my project especially deep learning, I encountered a reproductivity problem with multi-threading is enabled. Below is a MWE that describes our issue. Here, MyDataset returns idx from which comes the 2nd argument of getobs method.

# example.jl
module My

import DataLoaders.LearnBase: getobs, nobs
using Random

struct MyDataset
    ndata::Int
end

Base.getindex(dset::MyDataset, idx) = idx
getobs(dset::MyDataset, idx) = dset[idx]
nobs(dset::MyDataset) = dset.ndata

end # My

using DataLoaders
using Random

using .My

MyDataset = My.MyDataset

ntrial = 3

for t in 1:ntrial
    dset = MyDataset(10000) # create an instance of MyDataset
    loader = DataLoader(dset, 100) # setup loader
    for batch in loader
        @show batch # <------
        println()
        break
    end
end

From my understanding, for each t in 1:ntrial, @show batch should display array from 1 to 100 namely:

batch = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]

On the other hand, the actual behavior of the example.jl script above will output something like:

$ julia --threads=12 example.jl # num thread = 12
batch = [301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357, 358, 359, 360, 361, 362, 363, 364, 365, 366, 367, 368, 369, 370, 371, 372, 373, 374, 375, 376, 377, 378, 379, 380, 381, 382, 383, 384, 385, 386, 387, 388, 389, 390, 391, 392, 393, 394, 395, 396, 397, 398, 399, 400]

batch = [401, 402, 403, 404, 405, 406, 407, 408, 409, 410, 411, 412, 413, 414, 415, 416, 417, 418, 419, 420, 421, 422, 423, 424, 425, 426, 427, 428, 429, 430, 431, 432, 433, 434, 435, 436, 437, 438, 439, 440, 441, 442, 443, 444, 445, 446, 447, 448, 449, 450, 451, 452, 453, 454, 455, 456, 457, 458, 459, 460, 461, 462, 463, 464, 465, 466, 467, 468, 469, 470, 471, 472, 473, 474, 475, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 491, 492, 493, 494, 495, 496, 497, 498, 499, 500]

batch = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]

This phenomena happens when we specify the number of threads more than 1.

terasakisatoshi commented 2 years ago

Below is my output of versioninfo()

                _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.2 (2022-02-06)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia>  versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.5.0)
  CPU: Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

(EDIT): I've tested DataLoaders with 0.1.3

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.7.2 (2022-02-06)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

(@v1.7) pkg> st DataLoaders
      Status `~/.julia/environments/v1.7/Project.toml`
  [2e981812] DataLoaders v0.1.3
lorenzoh commented 2 years ago

DataLoader with multiple threads uses eachobsparallel, which does not guarantee a deterministic ordering.

DataLoaders.jl functionality is currently being added to MLUtils.jl (see https://github.com/JuliaML/MLUtils.jl/pull/33) and I am thinking to add an optional wrapper that reorders the batches, at the cost of some performance likely.

I won't add this here, though, since MLUtils.jl will supersede DataLoaders.jl. I'll leave this open and update once the functionality exists there 👍

terasakisatoshi commented 2 years ago

Thank you for your quick reply!

DataLoader with multiple threads uses eachobsparallel, which does not guarantee a deterministic ordering.

O.K. As for me, reproducibility of experiments is important when it comes to evaluate some performances in term of precision or accuracy etc...

I will also check out MLUtils.jl.

I am thinking to add an optional wrapper that reorders the batches, at the cost of some performance likely.

Great! Let me know when you are done.

lorenzoh commented 2 years ago

I made an issue that you can subscribe to :) https://github.com/JuliaML/MLUtils.jl/issues/68