Suggestion - allow a reference to be set

tseemann / ekidna

Assembly based core genome SNP alignments for bacteria

GNU General Public License v3.0

25 stars 3 forks source link

Suggestion - allow a reference to be set #17

Open BiologicalScientist opened 4 years ago

BiologicalScientist commented 4 years ago

Hi - really liking the tool! One suggested improvement would be to have an option to control which of the sequences gets set as your reference rather than defaulting to the largest sequence. If there is a reason this is a bad idea please let me know. I had a look at the code and from what I could tell the determination of the reference is using the snippet below. The $biggest_size variable looked to be only used to decide the $ref_index so it should be possible to select a reference sequence as the $ref_idx if provided. I wasn't sure if any of the other tools required the reference to be the largest sequence though.

  if ($size > $biggest_size) {
    $ref_idx = $N;
    $biggest_size = $size;
  }

The main reason for this suggestion is for when there is a reference sequence that has better annotations/ phenotypic data so understanding what has changed in relation to that sequence is useful.

tseemann commented 4 years ago

Ultimately it should noty matter what reference you use, because the SNPs it generates are "core" only. But that said, it could still be useful.

This project is very early stages, but I hope to work on it this month.

BiologicalScientist commented 4 years ago

Thanks for letting me know. I've currently made a bit of a quick workaround by just reversing the logic (making the reference be the smaller of the two sequences) and it seems to run fine.

The other advantage of doing the analysis with different references is you get the regions present only in the reference from the uncov.bed (if I am understanding the output correctly) which can be useful for finding where phage etc might be integrated.

Looking forward to the later stages when they come.

tseemann commented 4 years ago

It's important to realise ekidna is not a variant calling pipeline. It is an experiment to see how fast a SNP/alignment based phylogeny can be; to fit between sketch based methods (mashtree) and read/SNP methods (snippy).

I recently came across and packaged this new tool: https://github.com/hsinnan75/MapCaller It is very fast because it doesn't go via BAM files, and works from reads.

As for uncov.bed I can see how that would be useful with the correct reference. I will add the feature.

chrisgulvik commented 4 years ago

Ultimately it should not matter what reference you use, because the SNPs it generates are "core" only.

One reason I'd like to specify a ref is faster processing of subclustering. A sample of interest specified as ref against a large panel enables the reuse of alignments with ekidna -k. Once ekidna shows the primary cluster the unknown isolate is in, a subset of the samples could be analyzed with the same alignments for a refined (larger) core genome.

tseemann commented 4 years ago

Ok, that's a good use case. Thank you!