Subsampling to correct for lineage size

Hi,

Love the paper and this script is amazing. Really fascinating (and truthful way) of thinking about pan genomes!

I wanted to probe a bit deeper into the frequency table to reduce the impact of lineage size. You mention this in your paper, but would this workflow make sense to you?:

In-group subsampling of groups_to_keep repeated n times and gene frequencies calculated
These values then averaged
Average outputted as the frequencies.csv table

I was wondering if this analysis workflow makes sense within your method for within-lineage frequency? Also, if so, I am thinking implementation could be within the ## create a vector of frequencies for each group loop or just prior as an input.

Thoughts?

ghoresh11 / twilight

Subsampling to correct for lineage size #4