ancestral reconstruction for `N` nts should be opt in

jameshadfield commented 5 years ago

Currently for augur ancestral we have:

  --keep-ambiguous      do not infer nucleotides at ambiguous (N) sites on tip
                        sequences (leave as N). Always true for VCF input.
                        (default: False)
  --keep-overhangs      do not infer nucleotides for gaps (-) on either side
                        of the alignment (default: False)

I think these should be true by default for all input in augur v6 (therefore requiring new option names I guess). Combined with consistent colouring of Ns as grey (see https://github.com/nextstrain/auspice/pull/739) this won't mess up the default auspice display. I think this will be both more scientifically accurate, and prevent things such as this screenshot, which I think is pretty bad (Note that MAFFT inserts Ns at the start/end of alignments, not -s)

P.S. @emmahodcroft out of curiosity, why can't we infer the position if we want to from VCF input? or am I misreading that help message?

emmahodcroft commented 5 years ago

I personally am a big advocate of being careful about inferring bases from Ns (I put this in originally), but if we want to change the defaults we should be careful - it will mess with people's existing builds, possibly in confusing ways.

I've chatted about it with @rneher, and at risk of putting words in his mouth, as I understood, the thinking with most Fasta (viral) sequences (in particular flu) is that Ns are usually a minority and probably due to small areas of low coverage, and filling in a few bases here and there makes the tree as a whole easier to interpret.

Of course, as we expand to other viruses, and particularly in the case of overhangs, as you noted, this is not always the case.

For VCF, I'd certainly be open to discussing changing this, but the big difference with VCF is that at least in my experience you pretty much always have large (hundreds or thousands of bases) areas where sequencing has failed in some sequences. I think it's just too dangerous to reconstruct these ever - who knows how much information is missing, and it would be so easy for people to really misinterpret. (And unlike Fasta, with VCF it's really hard to see at-a-glance whether you have huge chunks missing or not.)

I suppose my take is that if you have really complete sequences in VCF then having a few N bases is really the least of your worries - but it's more likely you have at least a few huge chunks missing, and that there's really no sensible argument for inferring bases over such a huge space - most people wouldn't take the time to understand what's happening, and end up happily over-interpreting inferred data. But I accept this is a bit paternalistic, perhaps.

rneher commented 5 years ago

Note that MAFFT inserts Ns at the start/end of alignments, not -s

I don't think that is correct. my 7.271 certainly puts - at binning and end.

But I agree that reconstructing ambiguous states of terminal nodes is potentially confusing. Historically, this was a choice we made when only dealing with flu where you typically have a mix of HA1 only and full HA sequences. Filling in this missing data seemed like a good idea at the time, mainly to avoid the problems with N we had in the viz. If we have a satisfactory way of dealing with Ns at the viz level, I wouldn't object to changing the defaults.

emmahodcroft commented 5 years ago

I would need to go back and check more closely but I also have a memory of there being N's inserted at beginning and end... But possibly this is a combo of what MAFFT does and something else in the pipeline - this also vaguely rings a bell in my memory. (Example but possibly not true: MAFFT inserts gaps but later for some reason these end up as Ns, and then can end up inferred)

jameshadfield commented 5 years ago

It may not be MAFFT, but augur align (e.g. this example, which is the zika-tutorial) pads with Ns

rneher commented 5 years ago

yes. This is this flag:

https://github.com/nextstrain/augur/blob/master/augur/align.py#L39

I think we added this when dealing with sequences that had lots of indel error or drop-out of amplicons.

emmahodcroft commented 5 years ago

I'd be generally in favour of changing the defaults on all of this, and Richard is making some treetime changes that will also change the defaults there, I'd just say we should have good messages alerting users to the changed behaviour.

(This is one downside to using snakemake, though - a warning or message popping up in refine will be long gone scrolled off the screen if there are 4 more steps following it...)

It would certainly panic me if my run suddenly seemed to be missing loads of data after an update - I'd be worried something was seriously wrong. If we can find a good way to say "your run may look different now - here's why, and here's the flags to make it 'look the same'" I think that would be very considerate :)

jameshadfield commented 9 months ago

This issue hasn't seen a comment in ~5 years so I'm going to close it. We now have good documentation explaining the behaviour across the various commands.

nextstrain / augur

ancestral reconstruction for `N` nts should be opt in #352