ksahlin / IsoCon

Derives consensus sequences from a set of long noisy reads by clustering and error correction.
GNU General Public License v3.0
14 stars 1 forks source link

Assembling fragmented amplicons #8

Closed edgardomortiz closed 4 years ago

edgardomortiz commented 4 years ago

Dear Kristoffer,

We amplified two markers (the nuclear ribosomal cistron ~6000 bp and a low copy nuclear gene ~2500 bp) for 12 samples. Then we combined both markers for each sample and barcoded them for sequencing using Nanopore's RPB004 kit, which fragments the amplicons to attach the adapters (we will switch eventually to a kit that sequence the full length of the amplicons).

After demultiplexing and removing chimeric reads, I was testing IsoCon on one of the samples but it recovered 600 candidates (for two markers), and I think is because our data violates the assumption of "reads are cut at relatively precise positions (i.e., at the start and stop primer sites)" because the fragmentation breaks down the amplicon. However, there are reads with lengths that are close to the full length of the amplicon and the shorter ones are included within that length, like in this picture where I mapped the nanopore reads to one of the longest reads: 2020-06-14 11 43 30 am

I wonder if there is any parameter I can alter in IsoCon to use sub-length reads in the consensus, or any software that can accomplish what I need. For example, I tried isONclust but I read it doesn't reverse-complement the reads during clustering, so It wouldn't be completely ideal for this application. I also tried several de novo assemblers, but the coverage is to high and uneven and the targets are to short for these to work correctly.

Thanks in advance,

Edgardo

ksahlin commented 4 years ago

Hi Edgardo,

Some explanations as to why so many (possibly redundant) candidates are generated:

  1. IsoCon is currently only tuned for Iso-Seq error profiles.
  2. The imprecise ends (as you mentioned) makes partial sequences be considered as separate candidates.
  3. IsoCon does not handle reverse complements (you may verify that many of the amplicons may be reverse complements of each other).

To alleviate the issue of 1, you may set --max_phred_q_trusted 20 --p_value_threshold 0.00001 that is more suitable for ONT data (as the cost of detecting more subtle differences like amplicons differing with only 1-2bp -- depending on their depth of course). To alleviate the issue of 2, you could try to increase --ignore_ends_len to say 30 (default is 15). this will merge anything identical buth with ends differing with 30bp. There is no parameter to "fix" 1, but maybe this is easy to filter after IsoCon is run? Another strategy is to identify the primers beforehand and re-orient the reverse complements (this will also speed runtime in case this is an issue for your dataset).

To summarize, running IsoCon with --max_phred_q_trusted 20 --p_value_threshold 0.00001 --ignore_ends_len 30 for this data might be worth a try.

However, let me point you to a possible alternative solution: running isONclust then isONcorrect with --racon --T 0.05 options, this should do a decent job to correct your reads as isONcorrect is designed for highly variable abundance transcript data. A script showing how to run isONclust + isONcorrect is found here, where you can ignore step 1 (pychopper). We furthermore tested that isONcorrect it infers and preserves variations relatively well. However, the corrected reads (output by isONcorrect) are not merged into consensus, but you can use them as input to IsoCon (with default parameters). This could be worth a try.

edgardomortiz commented 4 years ago

Thanks a lot for the advice, I will try both. Just another question about about isONclust, what does the parameter --rc_identity_threshold do?

edgardomortiz commented 4 years ago

I tried the isONcorrect alternative but it produced errors: https://github.com/ksahlin/isONcorrect/issues/4 I will try changing the parameters of IsoCon now.

ksahlin commented 4 years ago

Ok I replied to that report.

As for the --rc_identity_threshold you can ignore it. It will not be invoked without the --consensus (experimental parameter). This is for a different purpose and will not be good for your data. The purpose of that parameter is to merge the cluster into one consensus and since your amplicons are similar(?) they will be in the same cluster.

edgardomortiz commented 4 years ago

The number of candidates increased to ~670 (from 600) after adding --max_phred_q_trusted 20 --p_value_threshold 0.00001 --ignore_ends_len 30

I think I will take the clusters with more than 30 reads from the isONclust results (they correspond better to the number of markers), and attempt separate de novo assemblies for each one.

Thanks for your help and for developing these tools!

ksahlin commented 4 years ago

Oh okay! What is the rough percentage identity between the amplicons?

If it's low enough the isONclust approach may work. But keep in mind that high identity ones will be included in the same cluster.

If you also expect one consensus per cluster, you can run isONclust with --consensus --abundance_ratio X --rc_identity_threshold Y --medaka to do exactly cluster consensus and polishing (you will need to install medaka independently though). X would here be some ratio (default 0.1) of how many reads you require a cluster (30/total_reads in your case) to have in order to do a consensus and Y is the lower identity to merge reverse complement consensus.

edgardomortiz commented 4 years ago

I ran:

for file in *.fastq; do
    isONclust --t 6 --ont --fastq $file \
    --consensus --medaka \
    --abundance_ratio 0.002 \
    --rc_identity_threshold 0.9 \
    --outfolder ../04_isonclust/${file//.fastq/_clusters} \
    "&>"../04_isonclust/${file//.fastq/.isonclust.log.txt} \
done

And I got this error at the end of the log:

STARTING TO CREATE CLUSTER CONSENSUS

Temporary workdirektory for polishing: /var/folders/6y/qm6480lx37q3qq3yn2gdhgx80000gn/T/tmp3fj8wdm3
creating center of 11258 sequences.
creating center of 5547 sequences.
creating center of 5212 sequences.
creating center of 124 sequences.
Iteration   NrClusters  MinDbSize   CurrReadId  ClusterSizes
4 centers formed
0.33048884554431307
0.6250864652985935
0.4712543554006969
0.2525223816967458
0.4477147304372435
0.2631099781500364
4 consensus formed.
Currently not implemented
removing temporary workdir
Traceback (most recent call last):
  File "/Users/pbiodiv/miniconda3/envs/isontools/bin/isONclust", line 263, in <module>
    main(args)
  File "/Users/pbiodiv/miniconda3/envs/isontools/bin/isONclust", line 139, in main
    print("Saving references in:", f.name)
UnboundLocalError: local variable 'f' referenced before assignment

I should remove --medaka for now, right?

ksahlin commented 4 years ago

Ok sure! Then your consensus will be the ones created by spoa which may be okay enough.

I also want to point you to a tool NGSpeciesID which is a spin-off of isONclust that we developed and does exactly what you are running now (it wraps isONclust, spoa, medaka/racon), but it is more mature software regarding the consensus part and installing it automatically installs medaka/racon. Sorry for not mentioning it earlier, slipped my mind.

edgardomortiz commented 4 years ago

I saw the repository, I will try that as well, thanks!

ksahlin commented 4 years ago

Also, notice that even if the previous run failed, you can still view all the spoa consensus created in the output folder. This may give you a hint whether it worked okay. So in that sense you already got the output even though the program failed.

edgardomortiz commented 4 years ago

Unfortunately the option --medaka skips the else statement where the consensus_references.fasta file is created: https://github.com/ksahlin/isONclust/blob/5b969b6d8cd42a8c4827942a50ef3f3feefec956/isONclust#L131-L141

ksahlin commented 4 years ago

Ah shoot, sorry for the misinformation! In this case, it is better and easier simply to proceed with NGSpeciesID as it is identical to what you currently run.

Since NGSpecies is an extension of isONclust it shares all parameters, and the argument call should be roughly equivalent to the script you pasted above.

edgardomortiz commented 4 years ago

Thanks a lot Kristoffer, the NGSpeciesID approach worked very well! It should work even better when we sequence the amplicons without fragmenting them.

ksahlin commented 4 years ago

Excellent, thanks for confirming.