Open nextgenusfs opened 4 years ago
Hi @nextgenusfs, thank you appreciate it!
Regarding the runtime bug: I actually fixed it today (see #6). So you can either reinstall IsoCon (version 0.3.3) by removing and reinstall from scratch, or it is simply sufficient to downgrade networkx to 2.3 as follows:
pip uninstall networkx
pip install networkx==2.3
As for the strategy I will get back tomorrow (it's pretty late in my timezone). Here are some general comments on IsoCon for ONT sequencing:
--max_phred_q_trusted 20
(default is 43 for hihger quality CCS reads) and --p_value_threshold 0.00001
(instead of default 0.01). This could however also be post-filtered by simply removing consensus with a p-value larger than e.g., 0.00001 (the p-value is printed to the accession of the consensus sequence)Great thanks. I downgraded networkx and I'll give it a re-try right now with your suggested ONT parameters.
The data I'm using is published, but I've already oriented and trimmed primers so I'm only trying to feed IsoCon the "cleaned up" data in hopes of being able to pick cluster centroids.
Also, regarding isONclust. you could increase cluster thresholds e.g., --mapped_threshold 0.9
--aligned_threshold 0.7
, (and perhaps -k 12 -w 15 if runtime allows it) this will be more stringent (more clusters).
Also note that isONclust has an experimental --consensus
parameter that performs what you said: spoa
then medaka
on each cluster. It may be convenient.
Hi @ksahlin. Thanks a ton for all of your tools with noisy reads. I'm looking for a solution for de novo clustering of ONT amplicon reads from environmental sequencing, ie fungal rRNA amplicons. The data I'm trying this on is from a mock community of mixed species. The region is the ITS-LSU region of rRNA in fungi -- we typically define species with a 97% pident cutoff with this region. The data has been pre-processed by re-orienting reads into the same direction and finding/trimming both forward and reverse primer sequences.
I've tried isONclust and at first it seemed like it might be working great (and quite fast), but then on further inspection it was a little too liberal on clustering the data that I have access to at the moment, effectively combining too many reads into the same "gene family". I ran a parameter search by varying k and w to see if I could get it to give me the proper results, but essentially never got a set of parameters that could delineate the clusters properly. My goal is to find a method to identify the "centroid", as then it is relatively straightforward to use
spoa
andracon/medaka
for error correction. I tried to clean up the clustering little bit by invoking a "sub clustering" by plotting read lengths (as fungal ITS-LSU sequences are variable in length) and then pulling out "peaks" from the lengths of reads -- this seemed to work okay, but still not quite what I'm looking for.Based on some of the other issues in your tool repositories, I then tried
IsoCon
which you had indicated seemed to be a more general approach.IsoCon
has a much much longer runtime and then eventually crashed with the error below (note I ran it initially without--prefilter_candidates --min_candidate_support 2
and it crashed with same error).If you have any other suggestions on an appropriate workflow I'd be grateful to hear your opinions.
Thanks, Jon