neherlab / treetime

Maximum likelihood inference of time stamped phylogenies and ancestral reconstruction
MIT License
224 stars 55 forks source link

Make fasta reader alphabet-aware (filter version) #282

Closed ivan-aksamentov closed 3 weeks ago

ivan-aksamentov commented 3 weeks ago

Check if characters are in alphabet when reading fasta and filter them out if not.

This has a problem - the sequences come out of different length and we can no longer deduce alignment length. I think this is wrong, because we are supposed to see aligned sequences.

Currently fails:

+/workdir/.build/docker/release/treetime ancestral --method-anc=parsimony  --tree=data/lassa/L/50/tree.nwk --outdir=tmp/smoke-tests/ancestral/parsimony/lassa/L/50 data/lassa/L/50/aln.fasta.xz
Error: 
   0: When calculating length of sequences
   1: Sequences are expected to all have the same length, but found the following lengths:

      Length 845:
          "MK107855"

      Length 871:
          "MK107845"

      Length 873:
          "MH887995"

Now only ebola fails (Makona-UK3 contains nuc U, all others - don't)

+/workdir/.build/docker/release/treetime ancestral --method-anc=marginal --dense=true --model=jc69 --tree=data/ebola/tree.nwk --outdir=tmp/smoke-tests/ancestral/marginal/ebola data/ebola/aln.fasta.xz
Error: 
   0: When calculating length of sequences
   1: Sequences are expected to all have the same length, but found the following lengths:

      Length 13915:
          "Makona-UK3"

      Length 19006:
          "EM_COY_2015_015982"
          "G3676"
          "EM_COY_2015_015980"
          "G3670"
          "CON-10590"
          "NM042"
          "EM_079497"
          <remaining sequence names here>

Can do char replacement instead (gap? unknown? still depends on alphabet)

A fallible alternative is here: