tseemann / snp-dists

Pairwise SNP distance matrix from a FASTA sequence alignment
GNU General Public License v3.0
126 stars 28 forks source link

ambiguous nucleotides? #31

Open trommleralex opened 5 years ago

trommleralex commented 5 years ago

Dear Torsten,

I want to calculate genetic distances between sequences that contain ambiguous bases, i.e. W, S, Y and so on. If I am not mistaken snp-dists can either ignore these positions or count them as a snp. However, I would like to use the ambiguous information, e.g.:

W vs. A or T -> print distance 0 W vs. G or C -> print distance 1

I also would love to stick to unix command line because I have thousands of sequences and could loop the command easily in unix.

Would you consider implementing the ambiguous base information thing into snp-dist or could you recommend any other program that can deal with them?

Thanks a lot and best wishes! Alex

kloetzl commented 5 years ago

Hi there,

Implementing ambiguous nucleotides is possible. However, weighing the comparisons is not trivial. You suggest that d(W,A) = 0 and d(W,T) = 0, but d(A,T) = 1. The distance thus is no longer a true metric distance. So I am unsure how the weighing should be implemented to satisfy all need of users of snp-dists.

Best, Fabian

trommleralex commented 5 years ago

Hei Fabian,

thanks for the quick reply. For my application it does not matter if distances are truly metric but I can see what the problem is... this is indeed non-trivial...

Anyway, thanks very much for your help! :-)

Best wishes!

Alex

Alexander Brandt, MSc Georg-August-University Göttingen J.-F.-Blumenbach-Institute of Zoology and Anthropology Dept. of Animal Ecology Berliner Straße 28 D-37073 Göttingen


From: Fabian Klötzl notifications@github.com Sent: Monday, August 26, 2019 1:41:38 PM To: tseemann/snp-dists Cc: Brandt, Alexander; Author Subject: Re: [tseemann/snp-dists] ambiguous nucleotides? (#31)

Hi there,

Implementing ambiguous nucleotides is possible. However, weighing the comparisons is not trivial. You suggest that d(W,A) = 0 and d(W,T) = 0, but d(A,T) = 1. The distance thus is no longer a true metric distance. So I am unsure how the weighing should be implemented to satisfy all need of users of snp-dists.

Best, Fabian

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/tseemann/snp-dists/issues/31?email_source=notifications&email_token=AICMAEKLXHSVWTNWH76NLNLQGO6PFA5CNFSM4IPN66W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5EDWUQ#issuecomment-524827474, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AICMAEJNREJJ6MD47IOLZR3QGO6PFANCNFSM4IPN66WQ.

tseemann commented 5 years ago

I think supporting IUPAC codes in some manner would be a good option to include, but it is complicated. What about d(W,W) and d(W, B/D/G/V) etc?

Nucleotide Code:  Base:
----------------  -----
A.................Adenine
C.................Cytosine
G.................Guanine
T (or U)..........Thymine (or Uracil)
R.................A or G
Y.................C or T
S.................G or C
W.................A or T
K.................G or T
M.................A or C
B.................C or G or T
D.................A or G or T
H.................A or C or T
V.................A or C or G
N.................any base
. or -............gap
kloetzl commented 5 years ago

I came up with an implementation that works with ambiguous nucleotides. @trommleralex Note that you have to fill the table in main.c also, you have to compile using make.

trommleralex commented 5 years ago

Hi Fabian,

this is awesome! Thanks so much for your effort, it will make things a lot more easy now!

Made my day :-)

All the best!

Alex

Alexander Brandt, MSc Georg-August-University Göttingen J.-F.-Blumenbach-Institute of Zoology and Anthropology Dept. of Animal Ecology Berliner Straße 28 D-37073 Göttingen


From: Fabian Klötzl notifications@github.com Sent: Wednesday, August 28, 2019 8:51:53 AM To: tseemann/snp-dists Cc: Brandt, Alexander; Mention Subject: Re: [tseemann/snp-dists] ambiguous nucleotides? (#31)

I came up with an implementationhttps://github.com/kloetzl/snp-dists/tree/ambiguous that works with ambiguous nucleotides. @trommleralexhttps://github.com/trommleralex Note that you have to fill the table in main.c also, you have to compile using make.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/tseemann/snp-dists/issues/31?email_source=notifications&email_token=AICMAEIOLAUMTVP2KEDEGYTQGYOATA5CNFSM4IPN66W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5KCPMI#issuecomment-525608881, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AICMAENBXO2DF2SXZJPV4UDQGYOATANCNFSM4IPN66WQ.

kullrich commented 3 years ago

Dear @tseemann, we once met in UK, Hinxton during ENA meeting. I took the opportunity and cloned your nice snp-dists repo. I have added a so-called basic literal-distance, which deals with IUPAC distances. The original code was not touched and one can still calculate the snp-dists distance. https://github.com/kullrich/literal-dists Best regards Kristian