jeromekelleher / sc2ts

Infer a succinct tree sequence from SARS-COV-2 variation data
MIT License
5 stars 3 forks source link

Do we really want the reference in as a Sample #413

Open jeromekelleher opened 1 day ago

jeromekelleher commented 1 day ago

I'm finding myself doing quite a lot of special casing to deal with having the reference being marked as a sample, in the same way as all the actual viridian samples. I really don't see what the advantage of marking it like this is - getting the reference sequence is trivial.

You initially flagged this in #152 @hyanwong - do you have any objections to switching this back?

The other alternative is to add the reference sequence to the Viridian dataset, but this would require more explanation all round I feel.

hyanwong commented 1 day ago

I mainly flagged it because it is actually a sampled genome, albeit from a different dataset, but I don't feel particularly strongly either way.

Maybe a more general solution would be to add strains from different datasets as separate "populations", so that we can do e.g. ts.samples(population=1) to get the viridian samples? That would also allow people to add other non-viridian samples at a later date, should they want to add to the ARG?

jeromekelleher commented 1 day ago

I'm going to remove it as from a practical perspective it's a pain, and the semantic purity of what we mean by a sample isn't that important.