pinellolab / DNA-Diffusion

🧬 Generative modeling of regulatory DNA sequences with diffusion probabilistic models 💨
https://pinellolab.github.io/DNA-Diffusion/
Other
356 stars 49 forks source link

Implement a sequence quality metric based on k-mer composition #42

Closed lucapinello closed 1 year ago

lucapinello commented 1 year ago

Roughly this is the idea: using a set of sequences (e.g. training set or generated set) we write first them in a temporary fasta file, using the obtained file (e.g. sequences.fa) call the kat hist function to get an histogram of kmers count. Here we can use k=7. Then we rescale the number count obtained of each k-mer based on the total sum to get probabilities of k-mers. Now assuming you have two probabilities vectors for k-mer corresponding to train sequences and generated sequences you can calculate the KL-diverge using the pytorch implementation that Cesar is also using. See: https://github.com/pinellolab/DNA-Diffusion/issues/14

lucapinello commented 1 year ago

This is a potential interface:

def KL_divergence_kmers(original, generated)

'original': list or pandas Series with original sequences used to train the model 'generated':list or pandas Series with generated sequences used to train the model

return a 'score' (float) calculating the Kullback-Leibler divergence

codeghees commented 1 year ago

@lucapinello - can you explain then we rescale the number count obtained of each k-mer based on the total sum to get probabilities of k-mers. Is there a dataset I can use to validate my functions?

codeghees commented 1 year ago

Do you want to rescale based on total number of k-mers?

lucapinello commented 1 year ago

Sure

1) can you explain then we rescale the number count obtained of each k-mer based on the total sum to get probabilities of k-mers.

The output of kat hist is a list of numbers, one number per k-mer (the count of that particular k-mer). You need to divide each k-mer count by the sum of all the count numbers so you get k-mer probabilities

E.g.

AAT 3 ATT 10 TTT 1

After the rescaling you get:

AAT 3/14 ATT 10/14 TTT 1/14

2) Is there a dataset I can use to validate my functions?

Yes you can use the same dataset used by @LucasSilvaFerreira and @zanussbaum in their notebooks. E.g. DNA sequences for a given component e.g. #15 corresponding to blood cells (https://www.meuleman.org/media/DHS_Vocabulary_components_WM20221007_hu3e206490677f480f3a01757800bdbab1_217268_1200x1200_fit_q75_h2_lanczos_3.webp) So we can see if the most frequent k-mers that you find are meaningful and mappable to things we expect/recognize.

codeghees commented 1 year ago

@lucapinello https://colab.research.google.com/drive/1JMCOmCSol-5rHOFIyweep_ph8qsrsB1Z?usp=sharing

I ran kat hist on ten samples from training. The output is:

# Title:7-mer spectra for: my_fasta.fa
# XLabel:7-mer frequency
# YLabel:# distinct 7-mers
# Kmer value:7
# Input 1:my_fasta.fa
###
1 1420
2 265
3 44
4 5
5 2
6 1
7 0
8 2
9 0
.
.
.

There are 1420 unique 7-mers. But this doesn't tell me which k-mer is it talking about, is there a tool that does it?

hssn-20 commented 1 year ago

I'm not sure if you're still trying to resolve this issue but I have managed to implement a simple similarity function based on the sourmash library. Here's a notebook implementation you can try: https://colab.research.google.com/drive/1SXwkN9WI-lNKINiGV9k7ZWnEHE9Hcv1G?usp=sharing

lucapinello commented 1 year ago

Thanks, this seems to call this function:

https://sourmash.readthedocs.io/en/latest/_modules/sourmash/minhash.html#MinHash.similarity

Do you know what is the range for this and/or more details on how to interpret the scale for this score?

Thanks,

Luca

On Mon, Nov 21, 2022 at 3:13 PM hssn-20 @.***> wrote:

I'm not sure if you're still trying to resolve this issue but I have managed to implement a simple similarity function based on the sourmash library. Here's a notebook implementation you can try: https://colab.research.google.com/drive/1SXwkN9WI-lNKINiGV9k7ZWnEHE9Hcv1G?usp=sharing

— Reply to this email directly, view it on GitHub https://github.com/pinellolab/DNA-Diffusion/issues/42#issuecomment-1322588930, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIH72W3XRO3OG6T77SJZ5DWJPJVZANCNFSM6AAAAAARSSAHBA . You are receiving this because you were mentioned.Message ID: @.***>

hssn-20 commented 1 year ago

The specific output (Jaccard similarity, cosine similarity, angular similarity) depends on the type of hashes we're using. But in all cases, the range is between 0 and 1. The closer to 1 the score is, the more its similar.

lucapinello commented 1 year ago

In your case, which one is called?

I see you are calling: similarity = round(mh1.similarity(mh2), 5)

But it is not clear if this calculates the Jaccard similarity, cosine similarity or angular similarity.

Also why do we need to round the value?

Thanks,

Luca

On Mon, Nov 21, 2022 at 4:00 PM hssn-20 @.***> wrote:

The specific output (Jaccard similarity, cosine similarity, angular similarity) depends on the type of hashes we're using. But in all cases, the range is between 0 and 1. The closer to 1 the score is, the more its similar.

— Reply to this email directly, view it on GitHub https://github.com/pinellolab/DNA-Diffusion/issues/42#issuecomment-1322633954, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIH72SYE43QLKWNM5JJ5Z3WJPPGVANCNFSM6AAAAAARSSAHBA . You are receiving this because you were mentioned.Message ID: @.***>

hssn-20 commented 1 year ago

In this instance it calculates the Jaccard similarity. If I was to create a scaled minihash like so: MinHash(n=0, ksize=3, scaled=1), it would give us cosine similarity instead. We don't need to round the values, that was just an arbitrary choice by me :) .

lucapinello commented 1 year ago

Thanks for the explanation! This is for sure calculating something related to kmer distributions. Do you know the length of each hash (not the number but the context they capture e.g. approximate words of length 7) or is this something we can impose?

On Mon, 21 Nov 2022, 4:16 pm hssn-20, @.***> wrote:

In this instance it calculates the Jaccard similarity. If I was to create a scaled minihash like so: MinHash(n=0, ksize=3, scaled=1), it would give us cosine similarity instead. We don't need to round the values, that was just an arbitrary choice by me :) .

— Reply to this email directly, view it on GitHub https://github.com/pinellolab/DNA-Diffusion/issues/42#issuecomment-1322654135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIH72TGWXWMCINA6IE4JSLWJPRDTANCNFSM6AAAAAARSSAHBA . You are receiving this because you were mentioned.Message ID: @.***>

hssn-20 commented 1 year ago

This is something we impose in the k_sizes parameter (3,7,20 in this case). This is what they say in their documentation:

sourmash uses a hash function (by default MurmurHash) that converts each k-mer into 64-bit numbers. These numbers form the basis of everything else sourmash does; the k-mer strings are not used internally at all.

lucapinello commented 1 year ago

Yes, this makes sense. I think using k between 7 to 20 for our case makes sense (typical TF binding length) however I am not sure how many hashes functions we should use but this is something we can quickly test given that this procedure is probably very efficient!

On Mon, Nov 21, 2022 at 4:27 PM hssn-20 @.***> wrote:

This is something we impose in the k_sizes parameter (3,7,20 in this case)

— Reply to this email directly, view it on GitHub https://github.com/pinellolab/DNA-Diffusion/issues/42#issuecomment-1322668472, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIH72SWVVDTF4HHYY3H5QLWJPSKXANCNFSM6AAAAAARSSAHBA . You are receiving this because you were mentioned.Message ID: @.***>

hssn-20 commented 1 year ago

Yep. Also, another issue we need to keep in mind is their hash sampling method. Say you have a sample of 1million sequences, and we decide to use 30 thousand hash functions, does it only keep the hashes of the last 30k sequences, or does it do something akin to reservoir sampling? I'm not sure what the answer is yet but to be safe side, I'm adding a shuffling function.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity.

LucasSilvaFerreira commented 1 year ago

Hi @hssn-20 ? Did you finish implementing the Kmers metrics? Do you think it is feasible to use it during training (to compare our 1000k synthetic sequences with train, test, shuffle)?

hssn-20 commented 1 year ago

https://github.com/pinellolab/DNA-Diffusion/pull/57 we can, I just need to change the name of the function as per the suggestion

LucasSilvaFerreira commented 1 year ago

57 we can, I just need to change the name of the function as per the suggestion

It looks great!

I have some code, and I think we can adapt both solutions: https://colab.research.google.com/drive/1X7_9-L9RUNN8g7tjki0XQNJrR4AbFrA3#scrollTo=JscBrTK-KCUs

This will be the format we expected to input (FASTA files) and compute the KMER frequency of both datasets.

Just a few questions

Are you comparing sequence by sequence (seq_a1 x seq_b1. | seq_a2 x seq_b2) or counting the kmer frequency from the whole dataset at once (Why do we have an average?)?

What kind of metrics is it using?

I like the idea of computing the metrics for 3 different kmers lengths.

Thanks

hssn-20 commented 1 year ago

https://github.com/pinellolab/DNA-Diffusion/pull/57 I've updated the pr to include a way to ingest fasta files.

To answer your questions:

p.s. I don't know how the fasta files will be structured so assumed sequences will be saved under the "seq" title like they are for NCBI data

hssn-20 commented 1 year ago

100 hopefully closes this issue. However, a choice still needs to be made of what kmer sizes (currently 3, 7, 20) should be used and if the output should be individual similarity scores for those sizes or an average.