Add n.seq.miss to multiphyDat

thibautjombart commented 9 years ago

Currently we have:

> getClassDef("multidna")
Class "multidna" [package "apex"]

Slots:

Name:               dna           labels            n.ind            n.seq
Class:       listOrNULL        character          integer          integer

Name:        n.seq.miss         ind.info        gene.info
Class:          integer data.frameOrNULL data.frameOrNULL
> getClassDef("multiphyDat")
Class "multiphyDat" [package "apex"]

Slots:

Name:               dna           labels            n.ind            n.seq
Class:       listOrNULL        character          integer          integer

Name:          ind.info        gene.info
Class: data.frameOrNULL data.frameOrNULL
>

So essentially identical data structures, but for n.seq.miss which can be added easily (and should?)

thibautjombart commented 9 years ago

Added at 32979e802e9d699e487e7c Still need to actually find out these sequences from phyDat objects.

KlausVigo commented 9 years ago

Hi Thibaut,

some random thoughts on the data structures, totally biased. There are situations where you don not want to have filled up you missing data with gaps. For individually optimizing gene trees it is better without and you may just want to build a supertree from your gene trees afterwards. So it is a matter if you add the gaps first and remove them afterwards or not include them to begin with and add them if needed, with concatenate for example. The later has some potential of saving memory. I am also not a fan of having missing gene and missing nuclotides both coded "-". This comes up as I am working on some code right now to get rid of duplicated sequences as it a good way to speed up optimization of phylogenies and one can add them with zero branch length later on (with correct multifurcations, the horror to many programs if you define trees not like ape!).

There should probably an additional slot in multiphyDat defining the data type DNA, AA, CODON, USER. I am not sure if we should rename dna to data, to allow for AA, codons, later on. I should write some translation function via seqinr between these (DNA, AA, CODON) for phyDat objects if I find some spare time.

A even more general model may even allow different gene copies per individuals. Models with gene duplication and loss (Bastien Boussau, Jean-Phillipe Doyon, there are also non-French working on it) should even gain information out of this and you would not have to worry about homologs / paralogs.

Cheers, Klaus

thibautjombart commented 9 years ago

Hi Klaus very interesting, but that's another issue. I'm moving it there: https://github.com/thibautjombart/apex/issues/21

thibautjombart / apex

Add n.seq.miss to multiphyDat #17