ncbi / magicblast

34 stars 16 forks source link

using magicblast to map transcriptome to genome #19

Closed 000generic closed 3 years ago

000generic commented 4 years ago

I have a series of related questions:

Is it reasonable to use magicblast to map a transcriptome to a genome to identify where it lies on the genome - and to identify intron-exon boundaries? Or are there better tools to do this? Or is it simply better to map the reads themselves to identify the boundaries on the genome?

Thank-you :)

boratyng commented 4 years ago

Is it reasonable to use magicblast to map a transcriptome to a genome to identify where it lies on the genome - and to identify intron-exon boundaries?

Yes, Magic-BLAST is very good at mapping transcriptome to a genome and very accurately identifies intron-exon boundaries. In this paper: https://www.ncbi.nlm.nih.gov/pubmed/31345161 we aligned human transcripts to the genome. Magic-BLAST was very accurate and outperformed other programs.

Or is it simply better to map the reads themselves to identify the boundaries on the genome?

It is difficult to answer this question without more details on what you are trying to do.

000generic commented 3 years ago

I have a question regarding output after using magicblast to map a transcript from one species to the genome of a related species. The ASN and SAM files seem to indicate spliced alignment was successful, if I'm reading them correctly. However, I would like to work with the tabular output, as its similar to blast report parsing that I am used to. When I look at the tabular file, it's not clear to me that splicing information is included. It seems like it only includes the start and stop of a 58 kb region in the genome - while the transcript is less than 4 kb.

Here is everything in the tabular output file:

# MAGICBLAST 1.5.0 # magicblast -db blue-whale-genome-FA -query 02-minke-whale-trpa1 -outfmt tabular -out 2-minke-whale-TRPA1_X_blue-whale-genome # Fields: query acc. reference acc. % identity not used not used not used query start query end reference start reference end not used not used score query strand reference strand query length BTOP num placements not used compartment left overhang right overhang mate reference mate ref. start composite score Minke-whale-Balaenoptera-acutorostrata-scammoni-XM_007185124.1-cds CM020957.1 94.9088 0 0 0 1 3945 64114112 64172440 0 99 3591 plus plus 3945 58CT83^2339^65-C265-C-G36TC12%6708%_170_10TC11GA139^2506^51AG81CA13TC12TC3AG11^3979^108^2906^109^187^10CT135^461^32TC56TC30GA16^9787^49^1955^11TC65CT22^652^101^1009^25TC144^87^3CT122CA38^1023^115^943^71CT81GA13^2626^94^3014^60^526^44TC14TA39^1486^68GA2^558^163^779^19CT66AG3^265^170^1604^60AG69^430^183^2975^58CT10^3594^114^1510^101^642^73GA131 1 - 1:0 - - - - 3591

My goal is to extract out of the genome the sequence that has aligned to the transcript. Is this not possible via tabular output in magicblast. If I were to use sam or asn outputs - what specific tools do you recommend (I'm unfamiliar with them and didn't see any tools for working with asn directly - and for samtools it seems to focus on the read/transcript side of things for generating fastas - but I may being missing something obvious)?

boratyng commented 3 years ago

Yes, the tabular format has splicing information. It is in the extended BTOP string (17-th column). The extended BTOP string is like BTOP string in BLAST tabular, but also includes introns. Introns are shown as a number between '^' charachters. This is how you interpret it:

So for example 10AG5^250^30 means that the alignment starts with 10 matches, then there is a single mismatch, then 5 matches, then intron of length 250, and then 30 matches.

Unfortunately I do not know any tools that would extract genome sequence automatically, but you can write a simple script that uses alignment start position on the genome and the BTOP string to extract the sequence.