isovic / graphmap

GraphMap - A highly sensitive and accurate mapper for long, error-prone reads http://www.nature.com/ncomms/2016/160415/ncomms11307/full/ncomms11307.html Note: This was the original repository which will no longer be officially maintained. Please use the new official repository here:
https://github.com/lbcb-sci/graphmap2
MIT License
178 stars 44 forks source link

graphmap spliced alpha: no spliced alignments reported #76

Open aechchiki opened 6 years ago

aechchiki commented 6 years ago

Hi Ivan,

we tested Graphmap to align RNA-seq reads to the reference genome. For benchmark purposes, we tried to align the reference transcripts in fasta format on the reference genome, both from Ensembl.

I tried to install and run Graphmap as described in the docs for spliced alignments. I compiled the rna-alpha branch, as suggested, and run this command: /home/aechchik/graphmap/bin/graphmap-not_release align -x rnaseq -t 16 -r reference-genome.fa -d reference-transcripts.fasta -o graphmap_ref.sam

The main problems is that the local alignments for the transcripts sub-features are not reconciled: if a transcript has two exons, then both are reported but their local alignments are not merged to a spliced alignment, reported separately as multiple alignments instead. For example - transcript FBtr0344900 has two annotated exons:

cat Drosophila_melanogaster.BDGP6.84.gtf | grep FBtr0344900 | grep exon
4    FlyBase    exon    42774    43157    .    +    .    gene_id "FBgn0266617"; transcript_id "FBtr0344900"; exon_number "1"; gene_name "CR45124"; gene_source "FlyBase"; gene_biotype "lincRNA"; transcript_name "CR45124-RA"; transcript_source "FlyBase"; transcript_biotype "lincRNA"; exon_id "FBtr0344900-E1";
4    FlyBase    exon    43231    43374    .    +    .    gene_id "FBgn0266617"; transcript_id "FBtr0344900"; exon_number "2"; gene_name "CR45124"; gene_source "FlyBase"; gene_biotype "lincRNA"; transcript_name "CR45124-RA"; transcript_source "FlyBase"; transcript_biotype "lincRNA"; exon_id "FBtr0344900-E2";

In the alignment file by graphmap, it figures twice as local alignment - once as mapping on the forward strand, once as secondary mapping on the forward strand ($2):

cat graphmap_ref.sam | grep FBtr0344900
FBtr0344900 0   4   42774   40  384M144S    *   0   0TGCGACATTGTTCTACGATGACTACAAAAAATGACCAATAACTTCTATAAACCAATACGATATGTCAGGAGTTTCGGTCCCATACGAAGTCGCCGACTTAAGTATTTTATTTTTATTTTGATATGTGTTTGCTATTTTACCTTGTCGAATGCTTCCACACGCTATGAGAATACCATCGTGAGCGTAGCTTACTACTAGAATTTTGTTGAAGTTATTGACAAGCGATGTCTCAATATCTTCCGGACAGCCTCCAGCGTGACATTGCGGGGAATCATGTAACGGCCCAGTAACAGCCTCGGCCAGCACTCGAAGGTTTTCGTTAAGTTTAAGTATTTTATTTGTAGCACCCGCAAACAAAACATTGTGCATAAAGTCGAAGCTCATCTGGAAGCTGTTGATTGAACTGGTATTGATGGCAAGTTAAACTGGGCGACTATGTCATTTAAGGGAGATAACGCCTGAGCCGGCAGTTCTTCAATGCAGTTAACGCAATAATGCTGAGAACCGAGTATGATAATAATACACAGT   *   MD:Z:384    NM:i:0  AS:i:1920   H0:i:1  ZE:f:4.54542e-127   ZF:f:0.299417   ZQ:i:528ZR:i:1348131
FBtr0344900 256 4   43231   40  384S143M1S  *   0   0TGCGACATTGTTCTACGATGACTACAAAAAATGACCAATAACTTCTATAAACCAATACGATATGTCAGGAGTTTCGGTCCCATACGAAGTCGCCGACTTAAGTATTTTATTTTTATTTTGATATGTGTTTGCTATTTTACCTTGTCGAATGCTTCCACACGCTATGAGAATACCATCGTGAGCGTAGCTTACTACTAGAATTTTGTTGAAGTTATTGACAAGCGATGTCTCAATATCTTCCGGACAGCCTCCAGCGTGACATTGCGGGGAATCATGTAACGGCCCAGTAACAGCCTCGGCCAGCACTCGAAGGTTTTCGTTAAGTTTAAGTATTTTATTTGTAGCACCCGCAAACAAAACATTGTGCATAAAGTCGAAGCTCATCTGGAAGCTGTTGATTGAACTGGTATTGATGGCAAGTTAAACTGGGCGACTATGTCATTTAAGGGAGATAACGCCTGAGCCGGCAGTTCTTCAATGCAGTTAACGCAATAATGCTGAGAACCGAGTATGATAATAATACACAGT   *   MD:Z:143    NM:i:0  AS:i:715    H0:i:1  ZE:f:3.39258e-42    ZF:f:0.299417   ZQ:i:528ZR:i:1348131

What we would rather see instead would be something like the output of STAR:

cat star_ref.sam | grep FBtr0344900
FBtr0344900    0    4    42774    255    384M73N144M    *    0    0TGCGACATTGTTCTACGATGACTACAAAAAATGACCAATAACTTCTATAAACCAATACGATATGTCAGGAGTTTCGGTCCCATACGAAGTCGCCGACTTAAGTATTTTATTTTTATTTTGATATGTGTTTGCTATTTTACCTTGTCGAATGCTTCCACACGCTATGAGAATACCATCGTGAGCGTAGCTTACTACTAGAATTTTGTTGAAGTTATTGACAAGCGATGTCTCAATATCTTCCGGACAGCCTCCAGCGTGACATTGCGGGGAATCATGTAACGGCCCAGTAACAGCCTCGGCCAGCACTCGAAGGTTTTCGTTAAGTTTAAGTATTTTATTTGTAGCACCCGCAAACAAAACATTGTGCATAAAGTCGAAGCTCATCTGGAAGCTGTTGATTGAACTGGTATTGATGGCAAGTTAAACTGGGCGACTATGTCATTTAAGGGAGATAACGCCTGAGCCGGCAGTTCTTCAATGCAGTTAACGCAATAATGCTGAGAACCGAGTATGATAATAATACACAGT    *    NH:i:1    HI:i:1    NM:i:0    MD:Z:528

Am I doing something wrong or simply this feature is not implemented yet?

Another strange behaviour as seen in the output alignment file is that some of the reads ID appear once in the alignment file, but there is no other alignment of the same read ID reporting the primary alignment flag. For example: cat graphmap_ref.sam | grep -v '^@' | awk '$2==256 {print $0}' | head -1 | cut -f1 outputs: FBtr0300689. If we grep this ID out of the mapping file, then we are returned one hit only:

cat graphmap_ref.sam | grep -v '^@' | grep FBtr0300689
FBtr0300689 256 2L  8191    40  586S1293M1S *   0   0CTACTCGCATGTAGAGATTTCCACTTATGTTTTCTCTACTTTCAGCAACCGAGAAGAGAACCCACGTTTGAACAAGTATCGGCGTGTGGACAACAGCTATCCCCGCTTCATAACGAATGAGGCTGCCGAGGACCTGATTTACAAGAAGTCCATGGGCGAGCGGGATCAGCCACAGAGCTCAGAGCGGATCTCAATATTTAATCCGCCAGTATACACGCAGCACCAGGTGCGCAATGAAGCCCCCTACATACCCACCACATTTGACCTCCTCTCAGACGATGAGGAGTCGTCACAGAGAGTTGCCAACGCCGGGCCATCTTTCAGGCCCTTGACTTACTCGGATGCTGTGCGTCTAAGCCAGAATGGCTTCGCCAACTCCCGCGTAAGTGGGCACTCCAGCTATACGGTGCGCAGACCACCGGCACTAGTTGACAGAAGCATTCTATCCCAGGAAATGGAGCGCATGGACCAAGAGCAGTATATCTACCTTATCCGTACCGCAGCCCAAAGTAATTCCGTGGGCAGTCACTACGCCGAACCGGTTACTGATAACTCGGAGGTCAAGAAAGTCAGTGAAACCAACAAAAGTGACGCACCACAGCCGTTAACCCCTCAACCTACCAGACTCACCAGAACAGAATCCTTGCACCGTCGTTTTGCCAGCTGCGTCAACTTAAATGATGACTTCGCCGAGCAATTTAAAGCAAGAGCGGCGGACTGTGAAGAGAAATCCAAACATCGTCTTAGATTAGCTGAAGAGCAGAGGCTTTTTTCGAATTTCAGTGCTATAAAGAACATAGATGAACTCCGTGCCTATGAACGAAAAGTAGTGGAAAACATATTCCAGTCTTGTATCGCCCACAAGCCCATTTTTGTACTCGGGCCCTTGGACAAGCCAAATGTGAAGAAAGTGACCAAGCTCATTCCGTTAACAGAGGAGCACCACGATCGCTTTAACGAAATTACACAGGATGATAAATCGACGGTATGGCAACGAATATATTGATGTCTTTCGTACCCATTGAAAACGTTGTGGTGCTTGCGCTTTAAAATCTTATATTAGGAAATTATTTTTAAATTTAACCTACACATAACTACCGAAGACATATGCACGTTTATTAATGGGAAATGGCTTAACGACGAGGTCATTAACTTTTACATGTCCTTGCTGACAGAACGGTCGGAGAAGAGATCTGGCGTACTTCCCGCCACTTACGCCATAAACACATTCTTCGTGCCCCGCCTCCTGCAAGCTGGGCATGCAGGCATTAAGCGCTGGACTCGCAAAGTGGACTTGTTCAGCAAGGACATAATCCCGGTACCAGTGCACTGCAACGGCGTCCACTGGTGCATGGCCATCATACACTTGCGGAACAAGACAATCCGGTATTATGACTCAAAGGGAAAGCCAAACCGACCAGTGCTGGACGCTCTAGAGAAATATCTACGCGAAGAGTCAATATTCAAGCCCAAAAAGCAGTTTGATACCAGCGATTTTGTTATTGAGAGCGTGCAGAATATACCACGACAGTTAGATGGCAGCGATTGCGGTATCTTCAGCTGCATGTTCGCCGAGTATATAACGTGTGATGTGCCAATTACCTTTACCCAGTCGGAAATGTTGTACTTCCGCAAGAAGATGGCTCTAGAAATCGTCGACGGAGAGTTGTGACAGTAGAATCACACAGCTACGCAAGAATGTGGAGAATCCAGTTTAGTTATTTTTACAAATCTTACGTAAACACTCCAAGCATGAATTCGCAACAAGTGCTTAGCTATTTAATTGAATTGAGCTGGCCGAGAGATGTGCTGGTGCAATAACTTGTTCTCATATCTGATTGTAACAGAGAATCTAGTTTTTCAATAAAATTTCCCCAAGT   *   MD:Z:1293   NM:i:0  AS:i:6465   H0:i:1  ZE:f:0  ZF:f:0.276046   ZQ:i:1880   ZR:i:23513712

In my opinion, this read ID should have alignment flag equal to 0 instead.

I compiled the branch alpha on CentOS 6.9:

aechchik@frt:~/software/graphmap$ git branch
  master
* rna-alpha

And graphmap-not_release is the only executable in the execution folder:

aechchik@frt:~/software/graphmap/bin$ ls
graphmap-not_release

Cheers, Amina

isovic commented 6 years ago

Dear Amina,

Thank you for testing and reporting back! You are right about exons being reported as separate SAM alignments. This was an intentional approach for the proof of concept. However, we are working on implementing a solution which would produce alignments similar to STAR, but it might take some time to finalize this.

Regarding the missing primary alignments - that should definitely not happen. Could you by any chance share a sample of your data which reproduces this issue, so I can take a look?

Best regards, Ivan.

aechchiki commented 6 years ago

Hi Ivan,

I ran the following command: /home/aechchik/graphmap/bin/graphmap-not_release align -x rnaseq -t 16 -r reference-genome.fa -d reference-transcripts.fasta -o graphmap_ref.sam where:

(I edited the comment, for some reason the links to the Ensembl ftp were not displayed )