thibautjombart / apex

Phylogenetic Methods for Multiple Gene Data
5 stars 3 forks source link

Data structure - missing sequences #21

Closed thibautjombart closed 9 years ago

thibautjombart commented 9 years ago

Hi Thibaut,

some random thoughts on the data structures, totally biased. There are situations where you don not want to have filled up you missing data with gaps. For individually optimizing gene trees it is better without and you may just want to build a supertree from your gene trees afterwards. So it is a matter if you add the gaps first and remove them afterwards or not include them to begin with and add them if needed, with concatenate for example. The later has some potential of saving memory. I am also not a fan of having missing gene and missing nuclotides both coded "-". This comes up as I am working on some code right now to get rid of duplicated sequences as it a good way to speed up optimization of phylogenies and one can add them with zero branch length later on (with correct multifurcations, the horror to many programs if you define trees not like ape!).

There should probably an additional slot in multiphyDat defining the data type DNA, AA, CODON, USER. I am not sure if we should rename dna to data, to allow for AA, codons, later on. I should write some translation function via seqinr between these (DNA, AA, CODON) for phyDat objects if I find some spare time.

A even more general model may even allow different gene copies per individuals. Models with gene duplication and loss (Bastien Boussau, Jean-Phillipe Doyon, there are also non-French working on it) should even gain information out of this and you would not have to worry about homologs / paralogs.

Cheers, Klaus

thibautjombart commented 9 years ago

There are situations where you don not want to have filled up you missing data with gaps. For individually optimizing gene trees it is better without and you may just want to build a supertree from your gene trees afterwards.

Makes sense. I also don't like the fact than when building trees, gap sequences all cluster together.

So it is a matter if you add the gaps first and remove them afterwards or not include them to begin with and add them if needed, with concatenate for example. The later has some potential of saving memory.

I made the same observation myself, writing the apex MS, earlier this afternoon ;)

I am also not a fan of having missing gene and missing nuclotides both coded "-". This comes up as I am working on some code right now to get rid of duplicated sequences as it a good way to speed up optimization of phylogenies and one can add them with zero branch length later on (with correct multifurcations, the horror to many programs if you define trees not like ape!).

I'm not sure I see a problem there. A missing sequence is effectively a collection of missing nucleotides, isn't it?

There should probably an additional slot in multiphyDat defining the data type DNA, AA, CODON, USER.

Yes! I just realised that this afternoon again - didn't know you could store AA etc. in phyDat. That's definitely something we want to keep track of for multiphyDat2multidna, which should work just with DNA sequences.

I am not sure if we should rename dna to data, to allow for AA, codons, later on. I should write some translation function via seqinr between these (DNA, AA, CODON) for phyDat objects if I find some spare time.

I agree for the translation to seqinr. I think that is a plus. As we don't use accessors, in principle we should not change the slot names, but it does make sense here and the package is young. Definitely need to sort this before the paper is published though. 'data' is vague.. '@dna -> @seq' for multiphyDat, and leaving @dna in multidna?

A even more general model may even allow different gene copies per individuals. Models with gene duplication and loss (Bastien Boussau, Jean-Phillipe Doyon, there are also non-French working on it) should even gain information out of this and you would not have to worry about homologs / paralogs.

I know some of these guys ;) sounds interesting, but I'd keep that for later?

Hey, that's the perfect time to have this discussion, thanks for starting it! =D

thibautjombart commented 9 years ago

I need to dash. @KlausVigo do you mind creating the relevant issues for all this so that we can keep track and sort them out fast?

thibautjombart commented 9 years ago

OK, I have created new issues for all this: https://github.com/thibautjombart/apex/issues/22 https://github.com/thibautjombart/apex/issues/23 https://github.com/thibautjombart/apex/issues/24 https://github.com/thibautjombart/apex/issues/25 https://github.com/thibautjombart/apex/issues/26 Please add if needed, and comment on the issues to agree/disagree/change stuff. I'm getting started now - want to wrap it all up ASAP.

thibautjombart commented 9 years ago

Done as of 0c020090c3124c95d50185d51eb83d8b4c9a4d23