tlawrence3 / FAST

FAST: Fast Analysis of Sequences Toolbox
31 stars 10 forks source link

`faswc` gives incorrect counts #64

Closed arendsee closed 7 years ago

arendsee commented 7 years ago

Given the FASTA file:

$ cat test.faa
>a
M
>b
MM
>c
MMMM
>d
MMMMMMMM

I get the following result:

$ faswc test.faa
     2              5 test.faa
     2              5 total

faswc appears to only be considering every other entry. Perhaps it is getting mixed up between FASTA and FASTQ format?

tlawrence3 commented 7 years ago

Thank you @arendsee for catching this. Are you using the most recent commit? I am currently TAing a lab, but will look in to this when it is over and should have a fix pushed by the end of the day.

arendsee commented 7 years ago

I am using the most recent commit from the master branch (I installed from source)

tlawrence3 commented 7 years ago

I tested this commit on fasta and fastq files and it seems to be working correctly. You no longer need to indicate if you are using a fastq file and parsing should be ~10-20X faster for fastq files. We will be implementing a test suite soon to hopefully catch these kinds of errors before pushing them. Thanks again @arendsee for catching this error and taking the time to report it.

arendsee commented 7 years ago

No problem. I have a little FASTA program of my own, smof that is pretty similar to FAST, if you are interested in taking a look. I've just been drafting a comparison of the two (in the README).

tlawrence3 commented 7 years ago

smof looks like a nice set of utilities. We do have significant overlap in functionality along with useful unique utilities. It is great that our tools are interoperable with so we can take advantage of unique features in both.

Currently, our goals are speeding up the FAST utilities within the limits of Perl. We are replacing most of the bioperl code, which has provided a 2X speed increase on fasta files and ~10-20X speed increase on fastq files.

arendsee commented 7 years ago

That should make the speed comparable to smof. You could also add support for indexed FASTA files.