Open gabyrech opened 2 years ago
Hi @gabyrech,
Looks like you get several large clusters from the clustering step, so the --mapped_threshold
and --aligned_threshold
are not the problem ( --min_shared 40
is pretty high though, but I trust that you know what your doing:)
It is strange to me that you after the clustering get the outcome:
0 centers formed
0 consensus formed.
The --abundance_ratio
(decides the minimum cluster size to form a consensus from) is by default 0.1 out of the total number of reads. Since the total number of reads seems to be 11,761 in your case it means that only clusters larger than 0.1*11,761=1,176 reads will be considered. You can try lowering this parameter. However, since you already have 5 clusters over this threshold, I'm not sure it solves is the problem.
Also, your output seems to be missing this print statement
print(
f"Forming draft consensus with abundance_cutoff >= {abundance_cutoff} "
f"({args.abundance_ratio * 100}% of {len(read_array)} reads)"
)
Which suggests you have an older version. While there should not be any major updates lately, perhaps it would be good to use the latest version (v0.1.3) and remove the output directory for a fresh rerun.
--t 1
, as the overhead of multiple passes in multiprocessing mode possibly outweighs the speedup in parallelization for such a small dataset. I would say 100k reads would be the rough cutoff for considering multiprocessing. (just guestimating here)Let me know how it goes, as my answer might not tackle the cause of the problem.
Hi @ksahlin ! Thanks for your quick and comprehensive response!
First let me explain a little bit more so maybe I can even ask you for advice :-).
About the data: These are targeted ONT sequences with A LOT of repetitive sequences (simple repeats most of them). What I want to do is to obtain as many consensus sequences as possible, but avoiding clustering reads that actually don't came from the same genomic region (which is very hard because they share the repetitive sequence with other genomic regions). This is why I though that using high --min_shared
--mapped_threshold
and --aligned_threshold
will allow me to cluster ONLY those reads that are very very similar, and therefore came from the same region. Do you think I am right?
About your suggestions:
--abundance_ratio
and see what happens. Any suggestion is very welcome! Thanks! Gabriel
Okay, I see! What's the rough error rate of your reads?
We do have IsoCon for a similar purpose if the reads have a relatively low error rate (say <5%). IsoCon assumes that reads are not to different in ends (i.e. roughly full-length over the targeted region).
We have also developed isONcorrect that could correct the reads before clustering (e.g. with IsoCon). isONcorrect are sort of allele (SNP/indel) aware and is in general robust, but it's main purpose is not to preserve SNPs (esp low-abundant mutations) but to reduce errors in reads.
We had one analysis with targeted gene families (but pacbio IsoSeq data) where we ran isONclust first and then ran IsoCon on each cluster individually. It worked for our data because isONclust first separated reads into different gene families, then IsoCon did a more fine-tuned separation of several alleles/transcripts from each gene.
oh! that sounds even better! My data consist on ONT reads generated with the flongle and basecalled with Guppy5+, so the error rate should be <5%. I also have a corrected version of these reads (corrected with Canu).
I think your approach (isONclust + IsoCon) using corrected reads might also work in our case, since we can say our data is something like having sequences from different gene families, but with the complexity added that they are full of simple tandem repeats.
I will give it a try and let you know how it goes. Thank so much for your advice! Gabriel
Hi there! I am trying to use NGSpeciesID with some custom parameters but I am not sure if maybe I am doing something wrong... here my command:
NGSpeciesID --consensus --t 24 --fastq fastq_pass.fastq.fq --outfolder out --k 15 --w 50 --min_shared 40 --mapped_threshold 0.99 --aligned_threshold 0.80
This is what I get:
I was suspecting that maybe I am too strict with the alignment thresholds, so I tried changing them a little bit (not too much) but I keep getting the same error. Really appreciate any clue on what is going on... Thanks! Gabriel