Reading fasta files containing amino acid sequences

mprous1 commented 7 years ago

aa_files.zip I'm having trouble reading multiple fasta files of amino acid sequences. Example script and fasta files rpl1 and rpl2 are attached as 'aa_files.zip'.

I tried in two ways:

1) rpl1=read.phyDat("rpl1.fasta",format="fas",type = "AA") rpl2=read.phyDat("rpl2.fasta",format="fas",type = "AA") seqs=new("multiphyDat",list(rpl1,rpl2),type="AA")

2) seqs1=read.multiphyDat(dir(pattern=".fasta"),format="fas",type="AA")

In both cases there is the same warning message: 1: In phyDat.DNA(data, return.index = return.index, ...) : Found unknown characters. Deleted sites with with unknown states. 2: In phyDat.DNA(data, return.index = return.index, ...) : Found unknown characters. Deleted sites with with unknown states.

Apparently the DNA sequences are still expected despite specifying type="AA". All columns containing characters except a,c,g,t (and those which can be interpreted as ambiguous, r, y etc.) are deleted and the result is a much smaller alignment.

Not sure what how to solve this problem.

Best, Marko

thibautjombart commented 7 years ago

Should be fixed by the PR by @KlausVigo

@mprous1 can you confirm it now works?

mprous1 commented 7 years ago

Yes, thanks! I was able to get it work when loading in R the two modified script files and one additional one: source("add.gaps.R") source("internal.R") source("multiphyDat.constructor.R")

Will the new version be available also in CRAN repository, sot that one can run install.packages('apex') in R to get the new version?

thibautjombart commented 7 years ago

Eventually, yes ;) I will try to make a new release at the same time as adegenet in the month to come.

thibautjombart / apex

Reading fasta files containing amino acid sequences #49