iqbal-lab-org / make_prg

Code to create a PRG from a Multiple Sequence Alignment file
Other
21 stars 7 forks source link

kmer ID attribution #2

Closed leoisl closed 1 year ago

leoisl commented 5 years ago

Hello there!

I am reading this code and trying to understand a little bit of it! I have a doubt on the kmer ID attribution, this code. I'd propose to check for the presence of the kmer instead of the full seq at the if.

In this test, these are the values of self.kmer_dict and seq_kmer_counts before:

15/03/2019 05:13:47 self.kmer_dict = {'TTT': 62, 'TTA': 59, 'TAT': 60, 'ATT': 61, 'TTC': 34, 'TCT': 35, 'CTT': 36, 'TTG': 55, 'TGT': 56, 'GTT': 57}
15/03/2019 05:13:47 These vectors have length 63
15/03/2019 05:13:47 seq_kmer_counts = [[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 2. 2. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 2. 2. 2. 6.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 3. 3. 3. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 6.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 3. 3. 3. 0. 2. 2. 2. 6.]]

and after a small fix:

15/03/2019 05:18:27 self.kmer_dict = {'TTT': 0, 'TTA': 1, 'TAT': 2, 'ATT': 3, 'TTC': 4, 'TCT': 5, 'CTT': 6, 'TTG': 7, 'TGT': 8, 'GTT': 9}
15/03/2019 05:18:27 These vectors have length 10
15/03/2019 05:18:27 seq_kmer_counts = [[6. 2. 2. 2. 2. 2. 2. 1. 1. 1.]
 [6. 1. 1. 1. 3. 3. 3. 1. 1. 1.]
 [6. 2. 2. 2. 0. 0. 0. 3. 3. 3.]]

There is not really difference regarding self.kmer_dict - each kmer still has an unique ID, but the seq_kmer_counts, which is fed into the KMeans algorithm might contains a lot of not meaningful 0s (whenever a kmer is repeated in the considered subalignment, we have an additional 0).

I am wondering if this is really a (minor) bug or I am just misunderstanding...

Thanks!

rmcolq commented 5 years ago

Looks to me like you found a bug!

leoisl commented 5 years ago

It seems pretty minor though, no changes to the KMeans clustering in my tests! Maybe KMeans clustering runs a little bit faster since the array you give to it might be smaller, but that is about it.

Can make a pull request, if you wish.

rmcolq commented 5 years ago

Happy for a pull request now, or later (looks like you've got a few tests over on your fork, and you might find a few more bugs before you are done :) )

leoisl commented 5 years ago

Hello again!

Very very sorry for the huge delay on answering you! Yeah, I am making a few additional tests so I can understand better the code, and also to make some bug hunting... But I think I might have found one or two minor bugs for now (i.e. does not seem to change the output). It is a nice script BTW, I've learned from it, thanks!

I shall try to fully understand the code in the next few weeks and make a PR later with what I've found, and if you also agree on the possible fixes. Not sure if I'll be able to do so though (last month of PhD, I hope you understand :p).

Hopefully you will be still around when I arrive, so that we can chat a little bit about the code!

rmcolq commented 5 years ago

Sounds good!

leoisl commented 1 year ago

Closing, outdated