iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

How to treat the genomes which come from the same source? for instance if three samples are biological replicates. #338

Open Isoris opened 1 year ago

Isoris commented 1 year ago

Hello, Thank you for making this useful tool, my question is the following,

I would like to know, when i'm making a pangenome from diverse strains of Streptococcus iniae species, there are some strains from the NCBI database which infact come from the same sampling, they come from the same fish farm, and were processed and sequenced and upload by the same person. So they basically cluster together into a clade on a phylogeny tree.

So from my understanding, pandora will make the mean of all nucleotides of the same positin at each loci from the input sequences and create a mean of those variations in the population. So if 3 isolates come from the same farm and have a A letter but another isolate come from a different farm and have a T, pandora will consider that the A represents the population the best, hence T is considered SNPs relative to pandora's reference.

So it makes no sense to include biological replicates because it will artificially make the reference erroneous.. If i'm comparing all S iniae bacteria isolates around the world and if 70 come from Australia and were always sampled 10 per farm, but then I want to compare it to 10 other countries which had 1 isolate per country. The reference will over-represent the isolates from australia.

So my idea was to first make a whole genome alignment using Progressive mauve, create the phylogenetic tree with phyML and then select 1 isolate per clade so that the proportion of bacteria is more balanced to build the pandora reference.

I would be grateful for any advices or comments or suggestions. Thank you.

iqbal-lab commented 1 year ago

Hi there, thanks for the interesting question. I think there are two issues here

  1. I think there is a misunderstanding " from my understanding, pandora will make the mean of all nucleotides of the same positin at each loci from the input sequences and create a mean of those variations in the population. So if 3 isolates come from the same farm and have a A letter but another isolate come from a different farm and have a T, pandora will consider that the A represents the population the best, hence T is considered SNPs relative to pandora's reference" Pandora doesn't weight the graph by population frequency. It gets the genes/MSAs and makes a graph, with equal weighting for all input samples. So, if you pass 3 identical-ish genomes in, it should not cause a problem. If they were sampled from the same colony I would say dont bother. From the same fish farm, they could be genetically different, so it's up to you, but you could leave them. The alleles wont be weighted by population frequency

  2. You are right to worry about sampling bias. It is possible the pan-genome in one location is different to another, so ideally you want to combine geographical sampling and genetic. We have been using poppunk to get clusters , and then sample eg 1 or 2 genomes from each poppunk cluster. Your phylogeny approach is an alternative too.

Isoris commented 1 year ago

I understand better now thank you.