grailbio / reflow

A language and runtime for distributed, incremental data processing in the cloud
Apache License 2.0
965 stars 52 forks source link

interning a large directory is slow and single-threaded #64

Closed jamestwebber closed 5 years ago

jamestwebber commented 6 years ago

We're running reflow on local hardware to demultiplex sequencing runs and push the results to S3. In testing it has been working great for MiSeq and NextSeq runs, but runs very slowly on NovaSeq runs. Looking into the problem, I believe this is just because the directory of NovaSeq data is 20x the size of a NextSeq run and reflow calculates sha256 hashes of all of the files in the directory.

One solution would be to tell it not to hash these files (or do a less reliable hash of filenames/sizes or something), but that's not ideal. I think we would be fine if we could just multithread the hash process (the local machine has plenty of capacity to do so).

If you could point us to where in the code this happens, and it isn't too complex to do, we will try to put together a PR for this. I think it would be very useful for reflow on local hardware.

mariusae commented 6 years ago

Are these localfile:// interns? We already parallelize checksumming on s3

mariusae commented 6 years ago

The localfile intern implementation is here which calls Executor.install which does parallelize checksumming.

The default limit is 60.

Probably we should make this relate to GOMAXPROCS and also make it configurable.

jamestwebber commented 6 years ago

Huh. I did see a lot of processes running but cpu usage was low. Maybe there is a different bottleneck here.

On Tue, Jul 31, 2018, 16:10 marius a. eriksen notifications@github.com wrote:

The localfile intern implementation is here https://github.com/grailbio/reflow/blob/62cdc6040623556d22e69961f730318550042d73/local/localfile.go#L80 which calls Executor.install https://github.com/grailbio/reflow/blob/62cdc6040623556d22e69961f730318550042d73/local/executor.go#L514 which does parallelize checksumming.

The default limit is 60 https://github.com/grailbio/reflow/blob/62cdc6040623556d22e69961f730318550042d73/local/executor.go#L41 .

Probably we should make this relate to GOMAXPROCS and also make it configurable.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/64#issuecomment-409397166, or mute the thread https://github.com/notifications/unsubscribe-auth/AA60Px6lz71R2-DfXz7JXcEi0R61673Iks5uMOP_gaJpZM4Vpdii .

jamestwebber commented 6 years ago

Looking at htop while this is running (on a NextSeq run), it looks like an IO issue. It's checksumming in many threads but none of the CPUs are anywhere near 100%. Probably because the directory contains many tiny files and so we're unable to keep them busy. I'm not sure if there's any optimization in reflow that could make this more efficient.

mariusae commented 6 years ago

Are the data stored on SSDs or magnetic disks?

We could add a mode to turn off caching for an exec in exchange for not needing to checksum input...

jamestwebber commented 6 years ago

It's on spinning disks. Turning off the checksum/cache seems like kind of a bummer, but I guess it would make this work quickly (well, except for actually running the demux) and allow us to keep things in reflow. The other option is to not use it for this purpose. I'm curious if you are using reflow for the demultiplexing step?

mariusae commented 6 years ago

For most of our samples, no: we have a different system that is responsible for data movement between the lab and our cloud environment: demultiplexed fastqs are delivered from this system. That being said:

jamestwebber commented 6 years ago

Hm okay. We have some SSD storage but up until now it didn't seem worth storing sequencing data on it. But it seems likely that reading every file individually and checksumming it would be really slow coming off the disk.

One useful optimization would be if we masked out some of the subdirectories and didn't include them as part of the input. bcl2fastq requires the folder structure but not all of the contents, and some of those subdirectories are full of a huge number of tiny files.

In particular I'm thinking of the Thumbnail_Images directory which is ~175,000 files but only 12G. The vast majority of the run is in the Data folder, which is huge but only ~8200 files that should stream from the disk efficiently. We can't delete the image folder but if we could pass the run path into reflow while omitting this one directory that might speed things up a ton.

jamestwebber commented 6 years ago

One possible solution that I'm trying is to just symlink the relevant parts of the run to a tmp location and pass that into reflow, so it doesn't see the massive irrelevant directories. Seeing how it runs now, hopefully doesn't take an unreasonable amount of time (but still needs to checksum ~800 GB of data).

jamestwebber commented 6 years ago

Hm, unfortunately this is still taking too long to be usable (due to the experimental setup we'll need to demux a single run multiple times).

It's puzzling to me why this is taking so long, though. We might have some kind of hardware configuration problem.

Anyway, it seems like this issue isn't a problem with reflow so you can close this.

mariusae commented 5 years ago

FWIW, there have recently been a number of changes and optimizations to reflow's S3 support.

Most likely the issue here is that, due to low computational requirements, a small instance type was chosen, which also has poor networking throughput.

We've recently made some changes in the cluster scheduler so that instances that support greater EBS throughput should be chosen by default.