seguid / seguid-tests

Unit tests for any SEGUID implementations
https://www.seguid.org
0 stars 0 forks source link

need for a reference of all ALPHABETS #14

Open louisabraham opened 3 months ago

louisabraham commented 3 months ago

the rust version currently passes with

        ("{DNA}", "GC,AT"),
        ("{RNA}", "GC,AU"),
        ("{DNA-extended}", "GC,AT,BV,DH,KM,SS,RY,WW,NN"),
        ("{RNA-extended}", "GC,AU,BV,DH,KM,SS,RY,WW,NN"),
        ("{protein}", "A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,O,U"),
        (
            "{protein-extended}",
            "A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,O,U,B,J,X,Z"
        ),

looks like the TCL version has some more:

  regsub {\{DNA-IUPAC\}} $alphabet "CG,AT,WW,SS,MK,RY,BV,DH,VB,NN" alphabet
  regsub {\{RNA-IUPAC\}} $alphabet "CG,AU,WW,SS,MK,RY,BV,DH,VB,NN" alphabet
  regsub {\{protein-IUPAC\}} $alphabet "A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,B,O,U,J,Z,X" alphabet
  1. we need to test ALL alphabets in the test suite
  2. if some alphabets are removed, we should check them in the meantime
  3. maybe some CLI option to display the alphabets in a standardized format like seguid --show-alphabets will help to achieve consistency between implementations?
HenrikBengtsson commented 3 months ago
  1. we need to test ALL alphabets in the test suite

We test them in https://github.com/seguid/seguid-tests/blob/main/tests-cli/70.alphabets-built-in.bats.

  1. if some alphabets are removed, we should check them in the meantime

I don't understand what this means?

  1. maybe some CLI option to display the alphabets in a standardized format like seguid --show-alphabets will help to achieve consistency between implementations?

That could be useful. What format do you suggest? Something like {DNA}=AT,CG or {DNA}: AT,CG?

HenrikBengtsson commented 3 months ago

looks like the TCL version has some more:

The seguid-tests should be our gold standard. If there's an implementation that provides more, then I think it needs to be removed, unless it can be argued for, which in case it should be implemented everywhere else.

FWIW, given that {XY} = {YX}, it looks like (a) {DNA-IUPAC} defines the same set as {DNA-extended}, {RNA-IUPAC} defines the same set as {RNA-extended}, and {protein-IUPAC} the same set as {protein-extended}, so the {*-IUPAC} ones should be removed.

The Tcl implementation is still not officially released anywhere. (I might have been too quick to list it on https://www.seguid.org/)

HenrikBengtsson commented 3 months ago

5. maybe some CLI option to display the alphabets in a standardized format like seguid --show-alphabets will help to achieve consistency between implementations?

That could be useful. What format do you suggest? Something like {DNA}=AT,CG or {DNA}: AT,CG?

FWIW, both the R and the Python CLI explains them in --help, e.g.

$ python -m seguid --help
usage: python -m seguid [-h] [--version] [--type [TYPE]] [--alphabet [ALPHABET]] [--form [FORM]]

...

Predefined alphabets:
 '{DNA}'              Complementary DNA symbols (= 'AT,CG')
 '{DNA-extended}'     Extended DNA (= '{DNA},BV,DH,KM,SS,RY,WW,NN')
 '{RNA}'              Complementary RNA symbols (= 'AU,CG')
 '{RNA-extended}'     Extended DNA (= '{RNA},BV,DH,KM,SS,RY,WW,NN')
 '{protein}'          Amino-acid symbols (= 'A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y')
 '{protein-extended}' Amino-acid symbols (= '{protein},O,U,B,J,Z,X')

and

$ Rscript -e seguid::seguid --help

...

Predefined alphabets:
 '{DNA}'              Complementary DNA symbols (= 'AT,CG')
 '{DNA-extended}'     Extended DNA (= '{DNA},BV,DH,KM,SS,RY,WW,NN')
 '{RNA}'              Complementary RNA symbols (= 'AU,CG')
 '{RNA-extended}'     Extended DNA (= '{RNA},BV,DH,KM,SS,RY,WW,NN')
 '{protein}'          Amino-acid symbols (= 'A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y')
 '{protein-extended}' Amino-acid symbols (= '{protein},O,U,B,J,Z,X')

It's the goal to eventually harmonize the --help output across implementations.

louisabraham commented 3 months ago

if some alphabets are removed, we should check them in the meantime

did we at some point change or remove some alphabets? if yes, we should manually check them in the source code

That could be useful. What format do you suggest? Something like {DNA}=AT,CG or {DNA}: AT,CG?

Yes, some --alphabet command that displays all available alphabets, one on each line and sorted by name. For each alphabet, we display all the valid pairs sorted lexicographically. (this sounds like we are reimplementing seguid on the alphabets haha)