mourisl / Rascaf

Scaffolding with RNA-seq read alignment
20 stars 6 forks source link

No improvements to assembly seen #21

Open chocotwig opened 4 years ago

chocotwig commented 4 years ago

Hi, I used TopHat to align my RNAseq PE reads. I'm getting absolutely no improvements on the final assembly, and I'm suspicious that I'm doing something wrong, because I did get decent improvements with L_RNA_scaffolder. rascaf -b $workDir/tophat_out/accepted_hits.bam -f $sourceDir/$sourceName -o "$descriptor"_rascafOut -v -ms 1 Output: Found 355447 exon blocks. Found 212574 gene blocks. Found 0 non-trivial gene block components.

rascaf-join -ignoreGap -r "$descriptor"_rascafOut.out -o "$descriptor"_rascafJoinOut.ignoreGap -ms 1 Output Found raw assembly file: (...assembly.fasta) Finish reading the rascaf output files. Finish removing cycles in the graph. Finish ordering the scaffolds. There is no change to the number of contigs, scaffolds, N50, etc. Could you please help me out?

Thanks!

mourisl commented 4 years ago

What is the command and version of tophat2 are you using? Rascaf needs mate pairs that aligned to two different chromosomes/scaffolds to scaffold the contigs, can you see such alignments in the bam file? Thanks.

chocotwig commented 4 years ago

I use bowtie2 to index, bowtie2-build $sourceDir/$sourceName $descriptor Then I use TopHat v2.0.13 tophat --read-mismatches 4 --read-gap-length 5 --segment-mismatches 3 --read-edit-dist 5 $workDir/$descriptor read_pair1 read_pair2

And yes, there seem to be read pairs aligning to different contigs/scaffolds. Example (SAM format)

SNPSTER3_0685:8:17:10090:20024#0    321 scaffold_8645   18788   0   40M contig_24906    72621   0   GAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGAGA    fefdhdfdfdfgcehghfeeddgdeedbdabb]aa_aa_W    AS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:40 YT:Z:UU NH:i:20 CC:Z:contig_69  CP:i:121336 HI:i:1

Thanks!

mourisl commented 4 years ago

These two entries are the secondary alignment, and Rascaf uses the primary alignments. The primary alignments might be on different chromosomes. You may try commands like: samtools view -F 0x90C sample.bam | awk '$7!="="' | wc -l

I think Tophat2 works better on reads with length more than 75bp. Can you try alingers like Hisat?

Thanks

chocotwig commented 4 years ago

Hi,

Thanks, the output for the command you suggested is 2157287 What are your thoughts on this? I will also try Hisat in the meanwhile.

Thanks!