Closed edgardomortiz closed 4 years ago
Hi Edgardo,
Some explanations as to why so many (possibly redundant) candidates are generated:
To alleviate the issue of 1, you may set --max_phred_q_trusted 20 --p_value_threshold 0.00001
that is more suitable for ONT data (as the cost of detecting more subtle differences like amplicons differing with only 1-2bp -- depending on their depth of course). To alleviate the issue of 2, you could try to increase --ignore_ends_len
to say 30 (default is 15). this will merge anything identical buth with ends differing with 30bp. There is no parameter to "fix" 1, but maybe this is easy to filter after IsoCon is run? Another strategy is to identify the primers beforehand and re-orient the reverse complements (this will also speed runtime in case this is an issue for your dataset).
To summarize, running IsoCon with --max_phred_q_trusted 20 --p_value_threshold 0.00001 --ignore_ends_len 30
for this data might be worth a try.
However, let me point you to a possible alternative solution: running isONclust then isONcorrect with --racon --T 0.05
options, this should do a decent job to correct your reads as isONcorrect
is designed for highly variable abundance transcript data. A script showing how to run isONclust + isONcorrect
is found here, where you can ignore step 1 (pychopper). We furthermore tested that isONcorrect
it infers and preserves variations relatively well. However, the corrected reads (output by isONcorrect
) are not merged into consensus, but you can use them as input to IsoCon (with default parameters). This could be worth a try.
Thanks a lot for the advice, I will try both. Just another question about about isONclust, what does the parameter --rc_identity_threshold
do?
I tried the isONcorrect
alternative but it produced errors: https://github.com/ksahlin/isONcorrect/issues/4
I will try changing the parameters of IsoCon
now.
Ok I replied to that report.
As for the --rc_identity_threshold
you can ignore it. It will not be invoked without the --consensus
(experimental parameter). This is for a different purpose and will not be good for your data. The purpose of that parameter is to merge the cluster into one consensus and since your amplicons are similar(?) they will be in the same cluster.
The number of candidates increased to ~670 (from 600) after adding --max_phred_q_trusted 20 --p_value_threshold 0.00001 --ignore_ends_len 30
I think I will take the clusters with more than 30 reads from the isONclust
results (they correspond better to the number of markers), and attempt separate de novo assemblies for each one.
Thanks for your help and for developing these tools!
Oh okay! What is the rough percentage identity between the amplicons?
If it's low enough the isONclust
approach may work. But keep in mind that high identity ones will be included in the same cluster.
If you also expect one consensus per cluster, you can run isONclust
with --consensus --abundance_ratio X --rc_identity_threshold Y --medaka
to do exactly cluster consensus and polishing (you will need to install medaka independently though). X
would here be some ratio (default 0.1) of how many reads you require a cluster (30/total_reads in your case) to have in order to do a consensus and Y
is the lower identity to merge reverse complement consensus.
I ran:
for file in *.fastq; do
isONclust --t 6 --ont --fastq $file \
--consensus --medaka \
--abundance_ratio 0.002 \
--rc_identity_threshold 0.9 \
--outfolder ../04_isonclust/${file//.fastq/_clusters} \
"&>"../04_isonclust/${file//.fastq/.isonclust.log.txt} \
done
And I got this error at the end of the log:
STARTING TO CREATE CLUSTER CONSENSUS
Temporary workdirektory for polishing: /var/folders/6y/qm6480lx37q3qq3yn2gdhgx80000gn/T/tmp3fj8wdm3
creating center of 11258 sequences.
creating center of 5547 sequences.
creating center of 5212 sequences.
creating center of 124 sequences.
Iteration NrClusters MinDbSize CurrReadId ClusterSizes
4 centers formed
0.33048884554431307
0.6250864652985935
0.4712543554006969
0.2525223816967458
0.4477147304372435
0.2631099781500364
4 consensus formed.
Currently not implemented
removing temporary workdir
Traceback (most recent call last):
File "/Users/pbiodiv/miniconda3/envs/isontools/bin/isONclust", line 263, in <module>
main(args)
File "/Users/pbiodiv/miniconda3/envs/isontools/bin/isONclust", line 139, in main
print("Saving references in:", f.name)
UnboundLocalError: local variable 'f' referenced before assignment
I should remove --medaka
for now, right?
Ok sure! Then your consensus will be the ones created by spoa
which may be okay enough.
I also want to point you to a tool NGSpeciesID which is a spin-off of isONclust that we developed and does exactly what you are running now (it wraps isONclust
, spoa
, medaka/racon
), but it is more mature software regarding the consensus part and installing it automatically installs medaka/racon. Sorry for not mentioning it earlier, slipped my mind.
I saw the repository, I will try that as well, thanks!
Also, notice that even if the previous run failed, you can still view all the spoa
consensus created in the output folder. This may give you a hint whether it worked okay. So in that sense you already got the output even though the program failed.
Unfortunately the option --medaka
skips the else statement where the consensus_references.fasta
file is created:
https://github.com/ksahlin/isONclust/blob/5b969b6d8cd42a8c4827942a50ef3f3feefec956/isONclust#L131-L141
Ah shoot, sorry for the misinformation! In this case, it is better and easier simply to proceed with NGSpeciesID as it is identical to what you currently run.
Since NGSpecies is an extension of isONclust it shares all parameters, and the argument call should be roughly equivalent to the script you pasted above.
Thanks a lot Kristoffer, the NGSpeciesID
approach worked very well! It should work even better when we sequence the amplicons without fragmenting them.
Excellent, thanks for confirming.
Dear Kristoffer,
We amplified two markers (the nuclear ribosomal cistron ~6000 bp and a low copy nuclear gene ~2500 bp) for 12 samples. Then we combined both markers for each sample and barcoded them for sequencing using Nanopore's RPB004 kit, which fragments the amplicons to attach the adapters (we will switch eventually to a kit that sequence the full length of the amplicons).
After demultiplexing and removing chimeric reads, I was testing
IsoCon
on one of the samples but it recovered 600 candidates (for two markers), and I think is because our data violates the assumption of "reads are cut at relatively precise positions (i.e., at the start and stop primer sites)" because the fragmentation breaks down the amplicon. However, there are reads with lengths that are close to the full length of the amplicon and the shorter ones are included within that length, like in this picture where I mapped the nanopore reads to one of the longest reads:I wonder if there is any parameter I can alter in
IsoCon
to use sub-length reads in the consensus, or any software that can accomplish what I need. For example, I triedisONclust
but I read it doesn't reverse-complement the reads during clustering, so It wouldn't be completely ideal for this application. I also tried several de novo assemblers, but the coverage is to high and uneven and the targets are to short for these to work correctly.Thanks in advance,
Edgardo