makovalab-psu / DiscoverY

K-mer based classifier for Y-contig identification from Whole Genome Assemblies
MIT License
11 stars 5 forks source link

Proportion reported by classify_ctgs.py described incorrectly in README? #7

Open rsharris opened 5 years ago

rsharris commented 5 years ago

The README says "the proportion shared between each contig with a female reference is computed."

Maybe I am wrong about the rest of this, but it seems like that contradicts what the code does.

In both classify_fm_male_mode() and classify_fm_mode() it looks like what is reported is (C-F) / C, where C is the number of kmers in the contig (with duplicates counted as often as they appear and all-N kmers not counted) and F is the number of kmers in the contig and also in the female reference.

So a proportion reported as 1.0 would mean none of the contig's kmers were found in the female reference. So that would be evidence that the contig is from something not found in female — presumably male specific.

A proportion reported as 0.0 would mean all of the contig's kmers were found in the female reference. Evidence that the contig is not male specific.