scanner-research / scanner

Efficient video analysis at scale
https://scanner-research.github.io/
Apache License 2.0
615 stars 108 forks source link

Multithreaded column read from GCS bug #236

Open willcrichton opened 5 years ago

willcrichton commented 5 years ago

Working on @Haotianz94 transcript aligner. Empirically, when column read is multithreaded within a file, getting nondeterministic issues where incorrect bits are being read. Sometimes the buffer isn't long enough, sometimes the bytes are corrupted (pickle has an error).

willcrichton commented 5 years ago

From the email thread:

Been debugging this for a few hours. My experiments suggest that there is a race condition in the S3 API. What's happening is that when reading multiple metadata files from S3 in parallel in separate threads/clients, nondeterministically attempting to read a particular file will return the bytes for a different file being read at the same time. Like literally, executing read(LENGTH); seek(0); read(LENGTH) will return different bytes (also nondeterministically). I'm at a complete loss as to how this is possible. This seems to only happen with metadata files (small, 48 bytes), not with data files. My guess is that the bug is related to reading sufficiently small files. We've never seen this bug before because we've never run jobs with as few outputs per file as Haotian's (I/O packet size = 4).

Just pushed a workaround that uses multiprocessing instead of multithreading (c275d03cf42455390af7e190a63e16019701ed31). Still need to figure out what the core issue is.