ibest / ARC

Assembly by Reduced Complexity (ARC)
Apache License 2.0
41 stars 5 forks source link

ARC is much too slow for very large sets of reads #14

Closed samhunter closed 11 years ago

samhunter commented 11 years ago

When very large datasets are used (a full HiSeq lane for example), ARC is incredibly slow at splitting reads. I.E

[2013-06-20 11:35:09,603 INFO 21595] Split 3 reads for sample Sample1 target HWI-ST522_0060:7:2108:11503:138410#0/1_Cluster-3254_M072 in 3139.90650702 seconds

It might be necessary to re-think the current indexing scheme, perhaps going back to a simpler approach where the splitter runs through the whole file, pulling out every read that was hit and either writing it to memory or a temporary folder on the disk. This would make it so that all assemblies couldn't be kicked off until all reads had been processed.

Alternatively, we could dump support for BLAT and pull the reads directly from the SAM file, making it unnecessary to go to the original reads files entirely. This would also require that the entire SAM file was parsed before any assemblies could be started.

samhunter commented 11 years ago

After extensive testing it doesn't appear that it is possible to do better than the original approach (at least not without some serious investment, and even then indications are that the improvements would be minimal). This has to do with the random access nature of ARC's implementation. One things appears to speed ARC up tremendously (around a factor of 10x) is keeping the reads and/or working directory on an SSD. Even better if you have the RAM is to run everything in /dev/shm or some other memory-back RAM drive. This increases performance around another 10x over putting files on an SSD.