ksahlin / IsoCon

Derives consensus sequences from a set of long noisy reads by clustering and error correction.
GNU General Public License v3.0
14 stars 1 forks source link

Runtime on my dataset #3

Open ksahlin opened 6 years ago

ksahlin commented 6 years ago

IsoCon with default parameters is fast if

  1. The CCS flnc reads are cut at relatively precise positions (i.e., at the start and stop primer sites).
  2. The CCS flnc reads, in general, does not have a lot of errors in them. (The number of errors depends both on number of passes and transcript length. Rough estimation of what "a lot of errors" would be is if the majority of CCS reads had over 50 errors.)

If 1 is not satisfied, it has a bad effect on the runtime and, in addition, we cannot guarantee the quality of the algorithm output. If 1 is satisfied, but not 2, the algorithm is designed to give good quality output — it might just take a long time. The algorithm can be speed up by setting --neighbor_search_depth to a low value, e.g., 500 or 1000 (default is no limit). Read below why. Regarding " relatively precise positions" in 1, this means that most reads are within say 5-15 bases of the primer cut point. It doesn't matter if some of the reads might be inaccurately cut.

What is the runtime can I expect?

For targeted datasets such as the ones on FMR1, IsoCon's runtime on a Mac laptop using 3 cores was about 2 hours for the 3 cells datasets of patients and less than 30 mins for the 1 cell control datasets. In our simulated data we degenerated 12500 reads from transcripts (from DAZ) of length 6000bp with median error rate of 276 errors per read (corresponding to most reads having 1-3 passes see Table 1 simulated error profile). The algorithm takes ~15-20h to run on 4 cores. This can be compared with the other two datasets in that table, HSFY takes ~2h and TSPY takes ~5minutes on 4 cores. So the error rate plays a key role.

Why does IsoCon scale badly with 1-2 not being met?

We are using an exact approach to find the Nearest neighbor graph (defined in the manuscript). We sort sequences by length and place them in an array L indexed by i. For a given sequence L[i], we calculate the edit distance between L[i] and L[i +/- j], for j =1,2,3… . We keep track of the best edit distance we encounter for L[i] (call this value a_i), and we stop to compute edit distances whenever the sequence length difference between L[i] and L[i +/- j] is larger than a_i. We have then guaranteed that the nearest neighbor is found. This is how we guarantee exact NN-graph is built. If 1-2 is in general met, we don’t have to explore a lot of alignments (we find best edit distance with small j as nearest neighbours will be located close to each other in L). The algorithm can be speed up setting a limit on j by —neighbor_search_depth [low number e.g. 500 or 1000]. But a guarantee that nearest neighbours are not found is lost and it may affect quality of output. We are exploring the possibility to use minimap2 for the alignments.