Multithreaded column read from GCS bug

From the email thread:

Been debugging this for a few hours. My experiments suggest that there is a race condition in the S3 API. What's happening is that when reading multiple metadata files from S3 in parallel in separate threads/clients, nondeterministically attempting to read a particular file will return the bytes for a different file being read at the same time. Like literally, executing read(LENGTH); seek(0); read(LENGTH) will return different bytes (also nondeterministically). I'm at a complete loss as to how this is possible. This seems to only happen with metadata files (small, 48 bytes), not with data files. My guess is that the bug is related to reading sufficiently small files. We've never seen this bug before because we've never run jobs with as few outputs per file as Haotian's (I/O packet size = 4).

Just pushed a workaround that uses multiprocessing instead of multithreading (c275d03cf42455390af7e190a63e16019701ed31). Still need to figure out what the core issue is.

scanner-research / scanner

Multithreaded column read from GCS bug #236