Closed jamestwebber closed 5 years ago
Are these localfile://
interns? We already parallelize checksumming on s3
The localfile intern implementation is here which calls Executor.install which does parallelize checksumming.
The default limit is 60.
Probably we should make this relate to GOMAXPROCS
and also make it configurable.
Huh. I did see a lot of processes running but cpu usage was low. Maybe there is a different bottleneck here.
On Tue, Jul 31, 2018, 16:10 marius a. eriksen notifications@github.com wrote:
The localfile intern implementation is here https://github.com/grailbio/reflow/blob/62cdc6040623556d22e69961f730318550042d73/local/localfile.go#L80 which calls Executor.install https://github.com/grailbio/reflow/blob/62cdc6040623556d22e69961f730318550042d73/local/executor.go#L514 which does parallelize checksumming.
The default limit is 60 https://github.com/grailbio/reflow/blob/62cdc6040623556d22e69961f730318550042d73/local/executor.go#L41 .
Probably we should make this relate to GOMAXPROCS and also make it configurable.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/grailbio/reflow/issues/64#issuecomment-409397166, or mute the thread https://github.com/notifications/unsubscribe-auth/AA60Px6lz71R2-DfXz7JXcEi0R61673Iks5uMOP_gaJpZM4Vpdii .
Looking at htop
while this is running (on a NextSeq run), it looks like an IO issue. It's checksumming in many threads but none of the CPUs are anywhere near 100%. Probably because the directory contains many tiny files and so we're unable to keep them busy. I'm not sure if there's any optimization in reflow that could make this more efficient.
Are the data stored on SSDs or magnetic disks?
We could add a mode to turn off caching for an exec in exchange for not needing to checksum input...
It's on spinning disks. Turning off the checksum/cache seems like kind of a bummer, but I guess it would make this work quickly (well, except for actually running the demux) and allow us to keep things in reflow. The other option is to not use it for this purpose. I'm curious if you are using reflow for the demultiplexing step?
For most of our samples, no: we have a different system that is responsible for data movement between the lab and our cloud environment: demultiplexed fastqs are delivered from this system. That being said:
Hm okay. We have some SSD storage but up until now it didn't seem worth storing sequencing data on it. But it seems likely that reading every file individually and checksumming it would be really slow coming off the disk.
One useful optimization would be if we masked out some of the subdirectories and didn't include them as part of the input. bcl2fastq
requires the folder structure but not all of the contents, and some of those subdirectories are full of a huge number of tiny files.
In particular I'm thinking of the Thumbnail_Images
directory which is ~175,000 files but only 12G. The vast majority of the run is in the Data
folder, which is huge but only ~8200 files that should stream from the disk efficiently. We can't delete the image folder but if we could pass the run path into reflow while omitting this one directory that might speed things up a ton.
One possible solution that I'm trying is to just symlink the relevant parts of the run to a tmp location and pass that into reflow, so it doesn't see the massive irrelevant directories. Seeing how it runs now, hopefully doesn't take an unreasonable amount of time (but still needs to checksum ~800 GB of data).
Hm, unfortunately this is still taking too long to be usable (due to the experimental setup we'll need to demux a single run multiple times).
It's puzzling to me why this is taking so long, though. We might have some kind of hardware configuration problem.
Anyway, it seems like this issue isn't a problem with reflow so you can close this.
FWIW, there have recently been a number of changes and optimizations to reflow's S3 support.
Most likely the issue here is that, due to low computational requirements, a small instance type was chosen, which also has poor networking throughput.
We've recently made some changes in the cluster scheduler so that instances that support greater EBS throughput should be chosen by default.
We're running reflow on local hardware to demultiplex sequencing runs and push the results to S3. In testing it has been working great for MiSeq and NextSeq runs, but runs very slowly on NovaSeq runs. Looking into the problem, I believe this is just because the directory of NovaSeq data is 20x the size of a NextSeq run and reflow calculates sha256 hashes of all of the files in the directory.
One solution would be to tell it not to hash these files (or do a less reliable hash of filenames/sizes or something), but that's not ideal. I think we would be fine if we could just multithread the hash process (the local machine has plenty of capacity to do so).
If you could point us to where in the code this happens, and it isn't too complex to do, we will try to put together a PR for this. I think it would be very useful for reflow on local hardware.