Support for max-dist in 'closest' command

virus-evolution / gofasta

MIT License

34 stars 1 forks source link

Support for max-dist in 'closest' command #31

Open tseemann opened 2 years ago

tseemann commented 2 years ago

Thank you for writing gofasta - it has some of things i wanted to implement, plus more.

The closest command seems to be able to find the single closest per query, or the N closest.

Could there be an option to give all sequences within distance D ?

Also, if there are C equally good matches, it tiebreaks by completeness. Could this be optionally able to provide all of the matches? In COVID we often have many identical sequences geographically spread so want all those matches.

Or should updown be used for this?

benjamincjackson commented 2 years ago

updown will do what you want, and it will be faster (especially if you generate the csv-format input with updown list first) for SARS-CoV-2 at the moment. D (--dist-all) for updown is an int which is number of SNPs.

But I can't see any reason not to extend this functionality to closest too. As it stands I think D for closest would be a float which will be differences per site.

tseemann commented 2 years ago

Adding it to cloest would be great, as updown was not clear what it did at first, and the interface for cleoserst is much easier. tog et stsrted on, and as you said, the logic is already there. Thank you!

benjamincjackson commented 2 years ago

Still to do: optionally don't truncate the output (by first tiebreaking by genome completeness).