pagnani / ArDCA.jl

Autoregressive networks for protein
MIT License
33 stars 8 forks source link

Convert sampled sequences back to alignment file #19

Closed benjamin-lieser closed 2 years ago

benjamin-lieser commented 2 years ago

I have read the tutorial and have successfully computed the sample matrix from the sample method. Now I want this sequences in a fasta file (or similar). Is there an easy way to do this here, or at least some documentation on the coding from amino acids to the numbers in this matrix?

pagnani commented 2 years ago

Hi @Unlikus

Indeed we did not provide this utility. However, you can define this function

function my_write_fasta(filedest::String, Z)
           num2let = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I','K', 'L','M', 'N','P','Q', 'R', 'S','T', 'V', 'W', 'Y','-']
           N,M = size(Z)
           open(filedest,"w") do fp
               for s in 1:M
                   println(fp,"> Seq $s")
                   for a in 1:N
                       print(fp,num2let[Int(Z[a,s])])
                   end
               println(fp)
             end
           end
end

where:

So, as an example, you could run the following pipeline:

julia> arnet,arvar=ardca("data/PF14/PF00014_mgap6.fasta.gz");
julia> Zgen=sample(arnet,10000)
julia> my/write_fasta("foo.fasta", Zgen)

Last thing, if you want to define the function it is enough to copy and paste the definition above, either on the REPL or in a jupyter cell.

Let me know if you have any other problem

benjamin-lieser commented 2 years ago

Thanks a lot :)