pangenome / maffer

extract MSAs from genome variation graphs
Other
33 stars 1 forks source link

Malformed MAF output? #1

Open subwaystation opened 4 years ago

subwaystation commented 4 years ago

I tried the following:

Download the current public sequence resource GFA from https://workbench.lugli.arvadosapi.com/collections/lugli-4zz18-z513nlpqm03hpca. Then I did:

~/software/maffer/git/master/bin/maffer -g public_sequence_resource_22042020.gfa -m > public_sequence_resource_22042020.maf

I installed http://compgen.cshl.edu/phast/ which provides an Ubuntu binary. Then I ran into:

msa_view public_sequence_resource_22042020.maf > public_sequence_resource_22042020.fasta
warning: maf_read: MAF file must be sorted with respect to reference sequence if store_order=TRUE.  Ignoring out-of-order blocks
ERROR maf_read_subset: invalid idx_offset -2

The last post in https://github.com/glennhickey/progressiveCactus/issues/67 might explain our problem.

Currently also taking a look at https://biopython.org/wiki/Multiple_Alignment_Format, but they do not state, if a conversion to FASTA is possible.

subwaystation commented 4 years ago

So I could only get a FASTA output for each alignment:

from Bio import AlignIO

for multiple_alignment in AlignIO.parse("public_sequence_resource_22042020.maf", "maf"):
    print("printing a new multiple alignment")

    for seqrec in multiple_alignment:
    #    print("starts at %s on the %s strand of a sequence %s in length, and runs for %s bp" % \
    #          (seqrec.annotations["start"],
    #           seqrec.annotations["strand"],
    #           seqrec.annotations["srcSize"],
    #           seqrec.annotations["size"]))
        splitty = seqrec.id.split("/")
        splittty = splitty[3].split(":")
        AlignIO.write(multiple_alignment,
                  "%s.fa" % splittty[1],
                  "fasta")

Which is painfully slow of course......

subwaystation commented 4 years ago
from Bio import AlignIO

abc = AlignIO.parse("public_sequence_resource_22042020.maf", "maf")
AlignIO.write(abc,"public_sequence_resource_22042020.maf.fasta", "fasta")

Blindly does the job we want. I guess.

@ekg We should contact people who actually have to work with the data. You got anyone in mind?

AndreaGuarracino commented 4 years ago

Using

H   VN:Z:1.0
S   1   CAAATAAG
S   2   A
S   3   GCT
S   4   T
S   5   C
S   6   TTG
S   7   A
S   8   G
S   9   AAATTTTCTGGA
S   10  GATTACA
S   11  ATATGCATC
S   12  A
S   13  TCA
S   14  CCAACTCTCTG
P   x   1+,3+,5+,6+,8+,9+,10+,11+,13+,14+   *
P   y   1+,5+,6+,8+,9-,10+,11+,13+,11+,14+  *
L   1   +   3   +   0M
L   3   +   5   +   0M
L   5   +   6   +   0M
L   6   +   8   +   0M
L   8   +   9   +   0M
L   8   +   9   -   0M
L   9   +   10  +   0M
L   9   -   10  +   0M
L   10  +   11  +   0M
L   11  +   13  +   0M
L   11  +   14  +   0M
L   13  +   14  +   0M
L   13  +   11  +   0M

and doing

maffer -g t.gfa -m > t.maf
msa_view t.maf

I have a similar problem

ERROR: bad integers or strand in MAF (strand must be + for reference sequence) --
    "s y 13 12 - 64 AAATTTTCTGGA"

that suggests a malformed MAF output.