thegenemyers / FASTK

A fast K-mer counter for high-fidelity shotgun datasets
Other
112 stars 15 forks source link

Logex and Fastmerge bugs #13

Closed c-zhou closed 2 years ago

c-zhou commented 3 years ago

Hello Gene,

Logex and Fastmerge assume the ibyte field is always 3. I ran into a very small genome (~40 Mb), which gave me 2. The histograms from Logex and Fastmerge are correct. The Kmer tables, however, are junks.

(1) Line 1026 and 1044 should be something like bst+ibyte here instead of bst+3? https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1026 https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1044

(2) Line L1028 and L1050 should be something like this?

x = 0;
if (ibyte==3)
    x = ((int) bst[0] << 16) | (bst[1] << 8) | bst[2];
else if(ibyte==2)
    x = ((int) bst[0] << 8) | bst[2];
else if(ibyte==1)
    x = bst[0];

https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1028 https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1050

(3) You write the minval as 1. Do you think if it is worth updating it after Logex? https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1391

(4) another minor point is do you think if it makes sense to grant the group read permission to files generated by Logex?

Fastmerge is pretty much the same as Logex.

Best, Chenxi

thegenemyers commented 2 years ago

Hi Chenxi,

I'm very impressed at how quickly you understand my code albeit your 

understanding is not completely correct.

First, Fastmerge is not yet complete and is under development, so 

please ignore it for now. Note that it is not yet documented in the README. Apologies.

The value of ibyte, the prefix table width, depends on k and the 

size of the table (see FastK/count.c lines 1614-1619). For K = 31 or 21, ibyte will be 3 if the table has more than 256K entries, 2 if it has more than 1024, and 1 otherwise. All the table input is through the library, libfastk.c, which handles the prefix correctly so that if input tables have different prefix widths it should all be fine. Only for the table output do I fix at 3 and this should cause a problem only if K < 12. I was lazy and should probably arrange it so that Logex works for K < 12, but it doesn't seem this is true of any of your test case, so for the suggestions you give: did they really fix problem? I just did a run with k=21 a few days ago and Logex was fine with the new patch of 4 days ago. Please advise.

In thinking about your comments and reviewing the code, I did spot 

something that could cause a problem when the input tables have different prefix widths, namely the division of the tables into parts for threads around line 1344. I will fix this but the circumstances should be very rare.

If you could send me inputs that give incorrect results that would 

be great. Its critical Logex be correct as much of my Merqury.FK implementation depends on it.

Thanks so much,  Gene

On 9/18/21, 1:10 AM, c-zhou wrote:

Hello Gene,

Logex and Fastmerge assume the |ibyte| field is always 3, which causes problems for smaller Kmer sizes such as 31 and 21 where the |ibyte| field is 2. The histograms from Logex and Fastmerge are correct. The Kmer tables, however, are junks. I know you usually use K=41.

(1) Line 1026 and 1044 should be something like |bst+ibyte| here instead of |bst+3|? https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1026 https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1044

(2) Line L1028 and L1050 should be something like this?

x = 0; if (ibyte==3) x = ((int) bst[0] << 16) | (bst[1] << 8) | bst[2]; else if(ibyte==2) x = ((int) bst[0] << 8) | bst[2]; else if(ibyte==1) x = bst[0];

https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1028 https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1050

(3) Line 1392 need to be fixed as well.

if (ibyte == 3) fwrite(&three,sizeof(int),1,f); else if(ibyte == 2) fwrite(&two,sizeof(int),1,f); else if(ibyte == 1) fwrite(&one,sizeof(int),1,f);

https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1392

(4) You write the |minval| as 1. Do you think if it is worth updating it after Logex? https://github.com/thegenemyers/FASTK/blob/d69435929733dddb600b019fde1ab7f438c05e3c/Logex.c#L1391

(5) another minor point is do you think if it makes sense to grant the group read permission to files generated by Logex?

Fastmerge is pretty much the same as Logex.

Best, Chenxi

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/FASTK/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABUSINQNBWKMDS5H3MDOLA3UCPDGXANCNFSM5EIPVHLA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

c-zhou commented 2 years ago

Hi Gene,

I sent you an email with the genome assembly which produces 'ibyte=2'.

Thanks very much for your detailed explaination about how 'ibyte' was determined. I think the problem with the data I have is it contains only 33M mers and FastK requires at least 64M mers (instead of 256K - am I right? FastK/count.c lines 1614-1619) to have 'ibyte=3'.

Best, Chenxi

c-zhou commented 2 years ago

problem solved