Open lucblassel opened 3 years ago
hey @lucblassel , I'd like to work on this if it's not taken
Go ahead! Additionally, I think this command should output what type of biological sequence is in the file.
i.e the possible outputs would be:
hey @lucblassel , I don't think I understand it correctly. Could you please elaborate a little more with some sample input/output?
Thanks!
Hi @M-krishna, I will try to elaborate a little bit.
In biology there are 3 main types of biological sequences:
DNA and RNA are nucleic acids, meaning they are a succession of nucleotides, and proteins are a succession of single amino acids. nucleotides and amino acids can be represented by letters, so in effect a biological sequence can be represented by a string of letters. For example a DNA sequence could be as follows:
ATGGGTGTCAGTGGGTCA
There are 5 types of nucleotides A,C,G,T and U. DNA can only contain As,Cs,Gs and Ts while RNA can only contain As,Cs,Gs and Ts. There are other characters that indicate an ambiguity, for example a W means either a T or an A. The list of all the possible characters encoding nucleotides can be found in the IUPAC Nucelotide table at this page.
Proteins are comprised of amino acids, and there are 20 of them, the characters used to encode protein sequences can be found at this page, sometimes you can also find a "*" character in protein sequences.
I would like the fastago stats alphabet
command to guess what type of biological sequence is in the input.
if we have a fasta file seqs_1.fa
:
>Seq1
CAGTGGTCAGTCGATTTACGTWWGTCAGT
>Seq2
CAGTCAGTTTAGTCWWGTCAGTGTCGATW
>Seq3
GTCAGGCATCGATCGTAGTATTAGTCAGGTA
We can see that the sequences only have ATGC and W in them, this is compatible with both the amino acid and nucleic acid alphabets, however since there are almost exclusively ATGC we can guess that this is DNA. So we would like our command to output:
> cat seqs_1.fa | fastago stats alphabet
DNA
if instead we have the following file seqs_2.fa
:
>Seq1
CAGUGGUCAGUCGAUUUACGUWWGUCAGU
>Seq2
CAGUCAGUUUAGUCWWGUCAGUGUCGAUW
>Seq3
GUCAGGCAUCGAUCGUAGUAUUAGUCAGGUA
It's the same file but with Us instead of Ts, so we are going to guess that is it RNA and output:
> cat seqs_2.fa | fastago stats alphabet
RNA
If instead we have a file with many more characters seqs_3.fa
>Seq1
PIESPITNHBGAGAPPEISEEST
>Seq2
PIESPITNHBGAGAPPEISMMLP
>Seq3
VMLAPPAIESPIES---PIESPSA
Then we are going to guess that these are protein sequences, because all characters are compatible with the IUPAC amino acid alphabet and the diversity of characters is quite high.
> cat seqs_3.fa | fastago stats alphabet
Protein
And finally if we have a file seqs_4.fa
with characters that do not appear in any alphabet then we cannot guess what type of sequence this is, :
>Seq_1
ˆ&@&@HYCHAYHCA&*(#*&$*#&8
>Seq_2
CAˆˆ$J#KLFEAI(#@!KLLKRQ*(*((#@
>Seq_3
VAC&*&(#@JICA)*#@*$#((*CAJICA(
So we want to ouptput the fact that we don't know:
> cat seqs_3.fa | fastago stats alphabet
Unknown
The
fastago stats alphabet
subcommand would detect the alphabet used in the sequences or alternatively indicate a problem if no alphabet fits.