ksahlin / IsoCon

Derives consensus sequences from a set of long noisy reads by clustering and error correction.
GNU General Public License v3.0
14 stars 1 forks source link

Multiple CCS.bam file #1

Closed wyim-pgl closed 6 years ago

wyim-pgl commented 6 years ago

Dear Kristoffer,

Hello, I am trying to use IsoCon for my transcriptome. We have 30 cells to analysis and my flnc file was generated with 30 cells. Is it okay to use merged bam file through Samtools? or bax2bam?

Thank you.

Won

ksahlin commented 6 years ago

Hi Won,

If you have the ccs reads in separate “*.ccs.bam” files per cell, it should be okay to simply merge them with samtools. The important thing is that all the reads in the flnc fasta file are also found in the bam file.

However, is your Iso-Seq dataset targeted or not? IsoCon is designed for use with targeted Iso-Seq sequencing. If you have non-targeted dataset, the algorithm will likely not scale (in runtime) since IsoCon uses an alignment strategy is optimized for highly similar sequences. There is a way to control this (set low value for --neighbor_search_depth, e.g. --neighbor_search_depth 1000 or lower), but it will likely affect the quality of the output. We are currently working on an approach for non-targeted data that uses many of IsoCon's ideas and I hope to release this repository soon.

Best, Kristoffer

wyim-pgl commented 6 years ago

Dear Kristoffer,

Thank you for your comment. This dataset is NOT targeted. Our species is just polyploidy. I will use --neighbor_search_depth option to reduce the runtime. Does BAM file need special pulsefeatures? such as DeletionQV,DeletionTag,InsertionQV,IPD,MergeQV ?

Cheers,

Won

ksahlin commented 6 years ago

Hi,

No, IsoCon does not need the pulse features, it only needs the quality values that were generated for the CCS reads, i.e., the ccs bamfile should be the output generated by the tool ccs.

Ok, good to know about the nontargeted. I will definitely let you know when we the nontargeted approach ready. Non-targeted data has more variable cut points at the end of transcripts and this can cause some redundancy in IsoCon. There is a parameter for that as well --ignore_ends_len that we set to default value of 15 for targeted. It is possible that ends have higher variability in non-targeted and should therefor be increased (with the obvious downside if they are two different isoforms). I don't have any data on this variability for a good estimate though, maybe 30-50 or so.

wyim-pgl commented 6 years ago

Thanks!

Is it okay to use h5 to bam through bax2bam?

Does it need to .pbi file as well as .bai?

Also, does bam file need to be sorted?

I will use this option --ignore_ends_len.

It looks like process faster with --neighbor_search_depth

Regards, Won

ksahlin commented 6 years ago

Yes, that is what I've been using, namely: bax2bam {hdf5_path}/*bax.h5 -o {out}. Then, for the ccs tool, we have been using the commands (based on recommended settings):

ccs --numThreads=64 --polish --minLength=10 --minPasses=1 --minZScore=-999 --maxDropFraction=0.8 --minPredictedAccuracy=0.8 --minSnr=4 {input.bam_subreads} {output.ccs_bam}

The commands were taken from the snakemake file in our evaluation repository, line 180 and 196.

No, it does not need to be sorted. The default output from ccs works.

wyim-pgl commented 6 years ago

Thank you so much. I am running and let you know. Cheers,

Won

ksahlin commented 6 years ago

Hi again Won,

Just wanted to let you know that while working on extending the IsoCon algorithm for nontargeted data (repository not available yet), I’ve discovered additional parts in the original IsoCon code that would not scale to a nontargeted dataset (especially of size 30 cells). So I wouldn’t wait for IsoCon to try to finish. While I’m incorporating some of the changes in the IsoCon code (e.g., this commit ), I still believe that IsoCon is not suitable for a nontargeted dataset (runtime-wise), unless the reads are somewhat broken into rough batches first, based on e.g. some sequence similarity and length.

Best, K

wyim-pgl commented 6 years ago

Kristoffer,

Thank you for letting me know.

I will think about more way to do this.

Regards,

wyim-pgl commented 6 years ago

Hi Kristoffer,

Is it possible to run with subset? For example, we are targeting some specific gene. I can blast or map the CCS to them then run IsoCon.

ksahlin commented 6 years ago

That will probably work. Just make sure that all fasta sequences are also in the ccs.bam. Let me know how large your dataset is after blasting as well. It is possible that you want to set --ignore_ends_len to higher than 15 (default) if your reads are not cut at relatively precise breakpoints.

Let me know how it goes and I'm happy to help you get the most of this analysis.

ksahlin commented 6 years ago

Hi @ascendo , just wanted to notify how you can possibly make your analysis faster for a nontargeted dataset. Take-home message: cut transcripts at precise ends after blasting. See issue2 and issue 3.