thegenemyers / FASTK

A fast K-mer counter for high-fidelity shotgun datasets
Other
112 stars 15 forks source link

Unexpected relative profile results (possible user error) #3

Closed rsharris closed 3 years ago

rsharris commented 3 years ago

Synopsis: I have a contrived example with a short genome and some perfect reads sampled from that genome. I'm using the relative profile feature to get counts of read kmers across the genome. The results are mostly 1s with a few short stretches of higher counts. The 1s conflict with the result I expect.

Details:

The example can be found at https://github.com/rsharris/fake_reads_2021. apple.fa is a random genome of length 3K. orange.fa is 80 fake reads, each 300 bp, sampled from apple with no sequencing error, at random positions and with random orientation. Average depth ≈8X.

Running this FastK -k40 -t1 apple.fa FastK -k40 -t1 orange.fa FastK -k40 -t1 -p:orange.ktab apple.fa Profex apple.prof 1 > orange_on_apple.profile I expected to get a count profile that roughly matched the coverage profile I would get from piling up orange to apple alignments (realizing that the 40-mers absent at the end of the reads would reduce avg count to ≈7X).

(orange_on_apple.profile is in that same repo) Instead, I get mostly 1s with occasional short bursts of higher numbers. I'm at a loss to understand this. Am I missing something in how I am using the tools? I know the motivation for profiles was to profile along reads, not along genomes, so maybe there's a difference I'm not understanding.

I wonder if the fact that the kmer counts in apple itself are all 1 is relevant.

FWIW I'm using a fetch of the repo from about an hour ago.

thegenemyers commented 3 years ago

Are you sure about orange.fa, it does not look like an 8x of apple.
When I look at the histogram orange.hist produced in the second step with Histex it is:

Histogram of unique 40-mers of orange

Input: 17,475 unique 40-mers

  Freq:        Count   Cum. %
    17:           12     0.1%
    16:           35     0.3%
    15:            1     0.3%
    12:           49     0.6%
    11:           51     0.8%
    10:            3     0.9%
     9:           54     1.2%
     7:           18     1.3%
     6:          156     2.2%
     5:           20     2.3%
     4:            2     2.3%
     3:           96     2.8%
     1:        16978   100.0%

In otherwords most k-mers occuring once would be consistent with the output your are getting when you build the relative profile. So I looked at the k-mer table for orange.fa with Tabex and its the strangest thing, identical k-mers are not collapsed, e.g.:

Opening 40-mer table with 17,475 entries 0: aaaaaaatggaatctcgcagcgcctatagtagcaaggata = 1 1: aaaaaaatggaatctcgcagcgcctatagtagcaaggata = 1 2: aaaaaaatggaatctcgcagcgcctatagtagcaaggata = 1 3: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1 4: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1 5: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1 6: aaaaaatcgcaaactcacgaggttgaccacttaagtacgg = 1 7: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 8: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 9: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 10: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 11: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 12: aaaaaatgctttttaggcaataaaaaatcgcaaactcacg = 1 13: aaaaaatggaatctcgcagcgcctatagtagcaaggatac = 1 14: aaaaaatggaatctcgcagcgcctatagtagcaaggatac = 1 15: aaaaaatggaatctcgcagcgcctatagtagcaaggatac = 1 16: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 17: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 18: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 19: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 20: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 21: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 22: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 23: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 24: aaaaacctcttgcatacccacatagcttcgaagacgactc = 1 25: aaaaagcacatttccgcacagaccggactcaacactgcgc = 3 26: aaaaagcattttttcctcacaagctcacaggttcgaagct = 1 27: aaaaagcattttttcctcacaagctcacaggttcgaagct = 1 28: aaaaagcattttttcctcacaagctcacaggttcgaagct = 1

Hmm, its midnight here in Europe, I'll look at it tomorrow. Could again be some facet of a super small data set, albeit I did test a couple small ones previously.

I appreciate your kicking the tires in ways I obviously haven't, and am committed to solving every problem you report, albeit I hope (for your sake and my pride) there aren't too many more.

Best, Gene

On 4/1/21, 8:57 PM, Bob Harris wrote:

Synopsis: I have a contrived example with a short genome and some perfect reads sampled from that genome. I'm using the relative profile feature to get counts of read kmers across the genome. The results are mostly 1s with a few short stretches of higher counts. The 1s conflict with the result I expect.

Details:

The example can be found at https://github.com/rsharris/fake_reads_2021. apple.fa is a random genome of length 3K. orange.fa is 80 fake reads, each 300 bp, sampled from apple with no sequencing error, at random positions and with random orientation. Average depth ≈8X.

Running this |FastK -k40 -t1 apple.fa| |FastK -k40 -t1 orange.fa| |FastK -k40 -t1 -p:orange.ktab apple.fa| |Profex apple.prof 1 > orange_on_apple.profile| I expected to get a count profile that roughly matched the coverage profile I would get from piling up orange to apple alignments (realizing that the 40-mers absent at the end of the reads would reduce avg count to ≈7X).

(orange_on_apple.profile is in that same repo) Instead, I get mostly 1s with occasional short bursts of higher numbers. I'm at a loss to understand this. Am I missing something in how I am using the tools? I know the motivation for profiles was to profile along reads, not along genomes, so maybe there's a difference I'm not understanding.

I wonder if the fact that the kmer counts in apple itself are all 1 is relevant.

FWIW I'm using a fetch of the repo from about an hour ago.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/FASTK/issues/3, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUSINXDZAHQWSZZRD2AJZDTGS62VANCNFSM42HT5OVA.

rsharris commented 3 years ago

Oh yeah, I see what you mean about duplicates in the kmer table for orange. Very strange.

I'm going to try the same test on a couple different architectures, on the outside change that it's platform-dependent.

rsharris commented 3 years ago

I've tried it on three machines, all of which have duplicates in the orange kmer table.

One machine on campus running some version of linux. And two macs here at home. I can send you architecture details, but my guess about platform differences looks to be irrelevant.

rsharris commented 3 years ago

I tried a few similar datasets with various sizes. All with same result UNTIL I happened to try FastK -k40 -t2 orange.fa instead of FastK -k40 -t1 orange.fa. The table for -t2 looks more reasonable — by eyeball I didn't notice any duplicated kmers. But the histogram still seems to have too many for freq=1 (even though it shows a distribution that otherwise makes more sense than t=1 did, with a nice peak at freq=7).

I'm probably chasing after cats that are of no use to you, so I'll stop with that one.

thegenemyers commented 3 years ago

Its early morning here in Germany. It is a bug I unfortunately introduced last Saturday into the core sorting routine. I've found it (a one character mistake (again) :-)) and am doing a little more testing before I check the fix in.

richarddurbin commented 3 years ago

Let me know when you have checked in the fix. I am planning to do a whole set of FastK runs today and I want them to be correct.

On 2 Apr 2021, at 07:11, Eugene W Myers Jr @.***> wrote:

Its early morning here in Germany. It is a bug I unfortunately introduced last Saturday into the core sorting routine. I've found it (a one character mistake (again) :-)) and am doing a little more testing before I check the fix in.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/FASTK/issues/3#issuecomment-812341263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2FXZUICZ4QNGAXJTI7JSTTGVNXZANCNFSM42HT5OVA.

thegenemyers commented 3 years ago

@richarddurbin @rsharris

I just checked in the fixed version of FastK. I'm so sorry, I inadvertently screwed up the core MSD sort in the trunk version while working on trying to get more speed. And I did this on Saturday the day after my talk to the VGP gang. So I'm going to also send an e-mail to the group to let them know as there has been a bump in downloads since the talk. Sheisse.

Anyhow its fixed now (and a little faster than it was before :-) ).

Thanks Bob for catching this.