uci-cbcl / genomix

Parallel genome assembly using Hyracks
3 stars 2 forks source link

Handling `N`'s in the input files #69

Open jakebiesinger opened 10 years ago

jakebiesinger commented 10 years ago

There are going to be N characters in the input file. I think the proper way to handle these would be NOT to include any kmers containing N. As it is now, we throw away the entire read if any of the characters are N.

When we store the read, we could store the entire sequence, N and all, but that would mess up our 4-letter, 2-bit representation. For simplicity, I guess we could throw those reads away from the the ReadHead. But I still think the other non-N kmers should be included in the graph.

anbangx commented 10 years ago

What do you mean N characters? Could you give us an example?

On Thu, Nov 21, 2013 at 11:18 AM, Jake Biesinger notifications@github.comwrote:

There are going to be N characters in the input file. I think the proper way to handle these would be NOT to include any kmers containing N. As it is now, we throw away the entire read if any of the characters are N.

When we store the read, we could store the entire sequence, N and all, but that would mess up our 4-letter, 2-bit representation. For simplicity, I guess we could throw those reads away from the the ReadHead. But I still think the other non-N kmers should be included in the graph.

— Reply to this email directly or view it on GitHubhttps://github.com/uci-cbcl/genomix/issues/69 .

Best Regards,

Anbang Xu

jakebiesinger commented 10 years ago

Sure. ATAGCTGACTGNNNACTGATCG could be a valid input. We should include all kmers from this sequence that don't include the N's.