jfass / apc

(a) (p)erfect (c)ircle? ... tests DNA sequences for overlapping ends, then trims and rejoins, and aligns reads to test the join
11 stars 4 forks source link

Irregular last results #2

Closed stephenturner closed 8 years ago

stephenturner commented 9 years ago

Hi Joseph. I ran across this tool and wanted to give it a spin. I have a plasmid assembly and the dot plot tells me it's circular, but the ends are messy.

I'm trying to run apc on a single 90kb contig and I get the note that I have irregular LAST results. I tried running the commands that appear to be in the perl script, and I look at my alignment file, and this is what I see:

$ lastdb tmp ../hyb-plas120-nanopgm.fasta

$ lastal -s1 -x300 -f0 -T1 tmp ../hyb-plas120-nanopgm.fasta
# LAST version 548
#
# a=7 b=1 A=7 B=1 c=100000 e=40 d=15 x=300 y=9 z=300
# R=10 u=0 s=1 T=1 m=10 l=1 n=10 k=1 i=8192 w=1000 t=0.910239 g=1 j=3 Q=0
# tmp
# Reference sequences=1 normal letters=94654
#
#    A  C  G  T
# A  1 -1 -1 -1
# C -1  1 -1 -1
# G -1 -1  1 -1
# T -1 -1 -1  1
#
# Coordinates are 0-based.  For - strand matches, coordinates
# in the reverse complement of the 2nd sequence are used.
#
# score name1   start1  alnSize1    strand1 seqSize1    name2   start2  alnSize2    strand2 seqSize2    blocks
#
# batch 0
94654   plasmid_120_nanopore_pgm_corrected_pilon    0   94654   +   94654   plasmid_120_nanopore_pgm_corrected_pilon    0   94654   +   94654   94654
# Query sequences=1

That is, a single line, not 3, which is what the script was expecting. Any idea what's going on here? Thanks.

jfass commented 9 years ago

Hey Stephen,

Thanks for trying it! This would suggest to me that there is no overlap at the ends (which really deserves a better error message!). Have you tried aligning your plasmid sequence to itself using Gepard http://www.helmholtz-muenchen.de/en/icb/software/gepard/index.html? It'll create a dotplot, which should pretty clearly show any self-overlap (diagonals in the upper right and lower left corners) ...

~Joe

On Wed, Jul 8, 2015 at 10:59 AM, Stephen Turner notifications@github.com wrote:

Hi Joseph. I ran across this tool and wanted to give it a spin. I have a plasmid assembly and the dot plot tells me it's circular, but the ends are messy.

I'm trying to run apc on a single 90kb contig and I get the note that I have irregular LAST results. I tried running the commands that appear to be in the perl script, and I look at my alignment file, and this is what I see:

$ lastdb tmp ../hyb-plas120-nanopgm.fasta

$ lastal -s1 -x300 -f0 -T1 tmp ../hyb-plas120-nanopgm.fasta

LAST version 548

#

a=7 b=1 A=7 B=1 c=100000 e=40 d=15 x=300 y=9 z=300

R=10 u=0 s=1 T=1 m=10 l=1 n=10 k=1 i=8192 w=1000 t=0.910239 g=1 j=3 Q=0

tmp

Reference sequences=1 normal letters=94654

#

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

#

Coordinates are 0-based. For - strand matches, coordinates

in the reverse complement of the 2nd sequence are used.

#

score name1 start1 alnSize1 strand1 seqSize1 name2 start2 alnSize2 strand2 seqSize2 blocks

#

batch 0

94654 plasmid_120_nanopore_pgm_corrected_pilon 0 94654 + 94654 plasmid_120_nanopore_pgm_corrected_pilon 0 94654 + 94654 94654

Query sequences=1

That is, a single line, not 3, which is what the script was expecting. Any idea what's going on here? Thanks.

— Reply to this email directly or view it on GitHub https://github.com/jfass/apc/issues/2.

Joseph Fass Lead Data Analyst UC Davis Genome Center - Bioinformatics Core http://bioinformatics.ucdavis.edu/ jnfass@ucdavis.edu phone ~ 530.752.2698

stephenturner commented 9 years ago

Joe,

I did actually. I'm still puzzling over how to interpret (and deal with) the very large overlap, and the excess diagonals I see toward the end of the sequence (repetitive sequence?). But if I'm interpreting this correctly, the top-right and bottom-left diagonals suggest circularity, albeit with messy ends.

image

jfass commented 8 years ago

Now that's bizarre. I'm having trouble wrapping my head around that graph. It seems like there may be a (long) tandem duplication that's been collapsed into one copy toward the end of the contig. Does that make sense? So the diagonal that starts next to the "021_dim" on the y-axis indicates that the penultimate sequence (let's say, "tuvw" in the alphabet that covers the whole contig) overlaps the first sequence in the contig ("abcd"), but the following sequence "xyz" doesn't match "efg" ... instead it matches "uvw" again. So either some sequence has been incorrectly duplicated at the end of the contig, or a tandem duplication (of 10-20kbp) has been collapsed at the start of the contig. If I'm seeing it correctly.

How'd your plasmid name get flipped on the y-axis, btw?

What do you think? If there's a way to give a more meaningful error message, to help figure out what's going on here, I'd be happy to add it in. Actually, Keith Bradnam rewrote my awful Perl, and I have yet to apply his suggestions. He also found another tool or two ... something from AMOS, I believe, and a more recent tool called "Circulator" (which I can't find at the moment. I just stumbled across this https://github.com/sheikki/identifyCircularContigs/blob/master/identifyCircularContigs. But I have no idea if any of them will be able to figure out your situation.

~Joe

On Thu, Jul 9, 2015 at 2:19 AM, Stephen Turner notifications@github.com wrote:

Joe,

I did actually. I'm still puzzling over how to interpret (and deal with) the very large overlap, and the excess diagonals I see toward the end of the sequence (repetitive sequence?). But if I'm interpreting this correctly, the top-right and bottom-left diagonals suggest circularity, albeit with messy ends.

[image: image] https://cloud.githubusercontent.com/assets/460076/8591991/fa56f698-25f9-11e5-84bb-631ca1531c98.png

— Reply to this email directly or view it on GitHub https://github.com/jfass/apc/issues/2#issuecomment-119884111.

Joseph Fass Lead Data Analyst UC Davis Genome Center - Bioinformatics Core http://bioinformatics.ucdavis.edu/ jnfass@ucdavis.edu phone ~ 530.752.2698

stephenturner commented 8 years ago

Thanks for those suggestions. I actually ran into circlator (no "u") prior to apc, but with outputs undocumented it's not immediately useful. I took identifyCircularContigs for a spin, but need to dust off my perl to get this working.

But you're right, I've got some odd things happening at the end of my assembly. This was a single-contig nanopore-only assembly from celera polished with ion torrent reads using pilon. May warrant another minION run with newer flow cells.

RE: axis, just used the most recent version of gepard and it was like this out of the box.

Closing this issue because I believe apc did just what it should have when it reported irregular last results. Only suggestion: echo to screen the command used to run lastdb/lastal. Or add a -verbose option to echo to screen all the commands issued.