ARC is much too slow for very large sets of reads

ibest / ARC

Assembly by Reduced Complexity (ARC)

Apache License 2.0

41 stars 5 forks source link

When very large datasets are used (a full HiSeq lane for example), ARC is incredibly slow at splitting reads. I.E

[2013-06-20 11:35:09,603 INFO 21595] Split 3 reads for sample Sample1 target HWI-ST522_0060:7:2108:11503:138410#0/1_Cluster-3254_M072 in 3139.90650702 seconds

It might be necessary to re-think the current indexing scheme, perhaps going back to a simpler approach where the splitter runs through the whole file, pulling out every read that was hit and either writing it to memory or a temporary folder on the disk. This would make it so that all assemblies couldn't be kicked off until all reads had been processed.

Alternatively, we could dump support for BLAT and pull the reads directly from the SAM file, making it unnecessary to go to the original reads files entirely. This would also require that the entire SAM file was parsed before any assemblies could be started.

ibest / ARC

ARC is much too slow for very large sets of reads #14