lucblassel commented 3 years ago

The fastago stats alphabet subcommand would detect the alphabet used in the sequences or alternatively indicate a problem if no alphabet fits.

M-krishna commented 3 years ago

hey @lucblassel , I'd like to work on this if it's not taken

lucblassel commented 3 years ago

Go ahead! Additionally, I think this command should output what type of biological sequence is in the file.

i.e the possible outputs would be:

IUPAC amino acid
IUPAC nucleic acid DNA (if there are Ts)
IUPAC nucleic acid RNA (if there are Us)
Other / unknown (if you cannot detect what alphabet it is)

M-krishna commented 3 years ago

hey @lucblassel , I don't think I understand it correctly. Could you please elaborate a little more with some sample input/output?

Thanks!

lucblassel commented 3 years ago

Hi @M-krishna, I will try to elaborate a little bit.

Biological context

In biology there are 3 main types of biological sequences:

DNA
RNA
Proteins

DNA and RNA are nucleic acids, meaning they are a succession of nucleotides, and proteins are a succession of single amino acids. nucleotides and amino acids can be represented by letters, so in effect a biological sequence can be represented by a string of letters. For example a DNA sequence could be as follows:

ATGGGTGTCAGTGGGTCA

There are 5 types of nucleotides A,C,G,T and U. DNA can only contain As,Cs,Gs and Ts while RNA can only contain As,Cs,Gs and Ts. There are other characters that indicate an ambiguity, for example a W means either a T or an A. The list of all the possible characters encoding nucleotides can be found in the IUPAC Nucelotide table at this page.

Proteins are comprised of amino acids, and there are 20 of them, the characters used to encode protein sequences can be found at this page, sometimes you can also find a "*" character in protein sequences.

What the command should do

I would like the fastago stats alphabet command to guess what type of biological sequence is in the input.

Example 1

if we have a fasta file seqs_1.fa:

>Seq1
CAGTGGTCAGTCGATTTACGTWWGTCAGT

>Seq2
CAGTCAGTTTAGTCWWGTCAGTGTCGATW

>Seq3
GTCAGGCATCGATCGTAGTATTAGTCAGGTA

We can see that the sequences only have ATGC and W in them, this is compatible with both the amino acid and nucleic acid alphabets, however since there are almost exclusively ATGC we can guess that this is DNA. So we would like our command to output:

> cat seqs_1.fa | fastago  stats alphabet
DNA

Example 2

if instead we have the following file seqs_2.fa:

>Seq1
CAGUGGUCAGUCGAUUUACGUWWGUCAGU

>Seq2
CAGUCAGUUUAGUCWWGUCAGUGUCGAUW

>Seq3
GUCAGGCAUCGAUCGUAGUAUUAGUCAGGUA

It's the same file but with Us instead of Ts, so we are going to guess that is it RNA and output:

> cat seqs_2.fa | fastago  stats alphabet
RNA

Example 3

If instead we have a file with many more characters seqs_3.fa

>Seq1
PIESPITNHBGAGAPPEISEEST
>Seq2
PIESPITNHBGAGAPPEISMMLP
>Seq3
VMLAPPAIESPIES---PIESPSA

Then we are going to guess that these are protein sequences, because all characters are compatible with the IUPAC amino acid alphabet and the diversity of characters is quite high.

> cat seqs_3.fa | fastago  stats alphabet
Protein

Example 4

And finally if we have a file seqs_4.fa with characters that do not appear in any alphabet then we cannot guess what type of sequence this is, :

>Seq_1
ˆ&@&@HYCHAYHCA&*(#*&$*#&8

>Seq_2
CAˆˆ$J#KLFEAI(#@!KLLKRQ*(*((#@

>Seq_3
VAC&*&(#@JICA)*#@*$#((*CAJICA(

So we want to ouptput the fact that we don't know:

> cat seqs_3.fa | fastago  stats alphabet
Unknown

lucblassel / fastago

Add `fastago stats alphabet` subcommand #4

Biological context

What the command should do

Example 1

Example 2

Example 3

Example 4