sreeramkannan / Shannon

RNA-Seq
24 stars 13 forks source link

unexpected output in rc_s.py #24

Closed derekmorr closed 6 years ago

derekmorr commented 6 years ago

I'm seeing unexpected output from rc_s.py. It's processing each line one at a time, instead of processing the sequence as a whole. So the output has the lines in their original order, with the contents of each line reversed. But I think the program should reverse the sequence.

For example, given this input:

GTTTATTCAAACTTAGATACAGATGAAGAATAAGAATGTTCATATGCAAGAAATATTTTTTTCTTATTAT
TATATGTACAATACATTCTTTATGCTCTTACTTTCAAATGTTTCTAATTAAATGAGAGAAAGCCTATATT
CGATCTCGTTTCTAATGTATATTGTTTGTATTTCCTTTTACATATCAGATCTCTTGAATC

I would expect this output:

GATTCAAGAGATCTGATATGTAAAAGGAAATACAAACAATATACATTAGAAACGAGATCGAATATAGGCT
TTCTCTCATTTAATTAGAAACATTTGAAAGTAAGAGCATAAAGAATGTATTGTACATATAATAATAAGAA
AAAAATATTTCTTGCATATGAACATTCTTATTCTTCATCTGTATCTAAGTTTGAATAAAC

But I'm actually getting this:

ATAATAAGAAAAAAATATTTCTTGCATATGAACATTCTTATTCTTCATCTGTATCTAAGTTTGAATAAAC
AATATAGGCTTTCTCTCATTTAATTAGAAACATTTGAAAGTAAGAGCATAAAGAATGTATTGTACATATA
GATTCAAGAGATCTGATATGTAAAAGGAAATACAAACAATATACATTAGAAACGAGATCG
bx3 commented 6 years ago

Hello,

I believe Shannon expects single-line fasta input for majority of its code. So the simplest way to solve it is just converting multi-line input to single-line by using command:

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);} END {printf("\n");}' < input | tail -n +2 > output

where you specify input and output.

derekmorr commented 6 years ago

That awk command just removes the newlines. It doesn't address the issue that the sequence isn't being reversed properly. I think rc_s.py needs to buffer the sequence and compute the rc.

As an aside, I've been looking into rewriting parts of Shannon in rust. I rewrote rc_s.py and got an 8x speedup on several sample files. Is this something the maintainers would be interested in exploring?

derekmorr commented 6 years ago

Oh, my apologies. You meant to run awk before running rc_s.py. Yes, that would fix the issue.

bx3 commented 6 years ago

Great that it helps. Actually, we have been converting everything into C++ since last year, and it is nearly complete. Both memory and speed performance improve a lot, and we will release them once they are ready.

derekmorr commented 6 years ago

Is the c++ code available on a branch in a repo somewhere?

bx3 commented 6 years ago

Yes, it is under my repo at https://github.com/bx3/Shannon_Cpp_RNA_seq, and I am reviewing it now.

macmanes commented 6 years ago

That link is broken.

bx3 commented 5 years ago

C++ code is completed, and now release at https://github.com/bx3/Shannon_Cpp_RNA_seq/wiki