ksahlin / IsoCon

Derives consensus sequences from a set of long noisy reads by clustering and error correction.
GNU General Public License v3.0
14 stars 1 forks source link

Can IsoCon be used on nontargeted Iso-Seq data sets? #2

Open ksahlin opened 6 years ago

ksahlin commented 6 years ago

In general: No. IsoCon is designed for targeted sequencing where the CCS flnc reads are cut at relatively precise positions (i.e., at the start and stop primer sites). If this is not the case it may both affect runtime and quality of the output.

However, if a nontargeted Iso-Seq dataset is processed such that the flnc reads from a particular gene are extracted (e.g., by using pre-cluster module from TuFU or aligning ccs reads to genome/transcriptome and separate by region) and these reads are cut at the same start and end position -- IsoCon should work well. Keep in mind though that if reads are "cut", the quality values associated with the ccs reads will also have to be cut the same way to preserve the base quality values remains to their base. This could be done relatively easily from the bam file.

wyim-pgl commented 5 years ago

Do you have any example to do it? Does blasting means NCBI BLAST or reads alignment? Thanks!

ksahlin commented 5 years ago

Aligning is the better expression, any aligner aligning CCS reads to genome or transcripts should work. Thanks!

As for an implemented example I don't have any. But this simple procedure should work:

  1. Align CCS reads to reference of choice (genomic or transcripts) using minimap2 with -a set to produce a sam file. Minimap2 should have a parameter combination customized for aligning Iso-Seq reads.
  2. Use samtools to extract reads aligning to the region of interest
  3. Either run IsoCon directly on this subset of reads, or try to trim these reads based on their start and stop coordinates of the alignments, and run IsoCon on the trimmed version of these reads.

The "trimming" part is the only step that doesn't have a standard tool to do this. But it's possible it could work without this step. Especially if the resulting dataset is small (say, less than 10,000 reads).

wyim-pgl commented 5 years ago

The problem of my CCS fastq is the quality score is 5.

Subreads fastq file has !.

It looks like place holder during the SMRT analysis. Do you have any opinions regarding this?

Thanks!


Won Cheol Yim, Ph.D

Assistant Professor Department of Biochemistry & Molecular Biology University of Nevada – Reno

MS330 1664 N. Virginia Street Reno NV 89557 Office: +1 775-682-9447 Lab: +1 775-682-9448 Fax: 775-784-1419 Email: wyim@unr.edu http://www.plantbioinformatics.org

From: Kristoffer Sent: Thursday, October 4, 5:45 PM Subject: Re: [ksahlin/IsoCon] Can IsoCon be used on nontargeted Iso-Seq data sets? (#2) To: ksahlin/IsoCon Cc: Won C Yim, Comment

Aligning is the better expression, any aligner alinging ccs reads to genome or transcripts should work. Thanks! As for an implemented example I don't have any. But this simple procedure should work: Align CCS reads to reference of choice (genomic or transcripts) using minimap2 with -a set to produce a sam file. Minimap2 should have a parameter combination customized for aligning Iso-Seq reads. Use samtools to extract reads aligning to the region of interest Either run IsoCon directly on these reads, or try to trim the reads based on their start stop coordinates of the alignments. The "trimming" part is the only step that doesn't have a standard tool to do this. But it's possible it could work without this step. Especially if the resulting dataset is small (say, less than 10,000 reads). — You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fksahlin%2FIsoCon%2Fissues%2F2%23issuecomment-427212980&data=01%7C01%7Cwyim%40unr.edu%7C2de7c0ad299a4e0e67d108d62a5bd5cb%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1&sdata=l1GoK9o8njn7ScnYaZYSMKIr4eLnL%2FIrFYztzrGAQQ0%3D&reserved=0, or mute the threadhttps://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAA3XIVjqFAXvxImI1pCx4eAjRFuBtz_dks5uhqujgaJpZM4S7Ydv&data=01%7C01%7Cwyim%40unr.edu%7C2de7c0ad299a4e0e67d108d62a5bd5cb%7C523b4bfc0ebd4c03b2b96f6a17fd31d8%7C1&sdata=N5e0H%2Bxa803mg2LrLMO3DNnAvn81OEjxr1FMCX7C2bo%3D&reserved=0.

ksahlin commented 5 years ago

Running PacBio's CCS caller ccs with the parameter --polish on the subreads.bam files produces a ccs.bam file with base qualities. This ccs.bam file can be supplied together with a fasta file that contains only the flnc reads to IsoCon as

IsoCon pipeline -fl_reads <flnc.fasta> -outfolder </path/to/output> --ccs </path/to/filename.ccs.bam>

Where the flnc file can be obtained e.g. from lima and isoseq3 cluster in the new Iso-Seq pipeline.

IsoCon can however also be run with only a fasta file as (meaning that you would only have to convert the fastq to a fasta):

IsoCon pipeline -fl_reads <flnc.fasta> -outfolder </path/to/output>

However, since individual base qualities plays a key role in the algorithm, the accuracy of IsoCon will likely give better results with quality values.