ncbi / magicblast

34 stars 16 forks source link

fasta header handling, non-canonical splice site detection, strand information coding #16

Closed HegedusB closed 3 years ago

HegedusB commented 4 years ago

While using magic-blast I have found some additional issues I would like to ask about. First. I observed that the aligner has some problems with the fasta headers. If the reference sequence contains an “_” character it will not appear in the output “sam” file. An another problem what I have observed is with the detection of the non-canonical splice sites. It often misses the detection of the GC-AG; CT-AC splice sites. Finally, I have a feature request regarding strand information encoding. Unfortunately, the current encoding of magic-blast is not recognized by subsequent programs. (They use the standards of minimap2). Would it be possible to include a formatting option to “mimic” minimap2 output from magic-blast?

boratyng commented 4 years ago

Could you post an example of a FASTA header with "_" that does not appear in the SAM output? It would help us track down the issue.

Non-canonical splice sites are very rare, so implementing Magic-BLAST we decided to err on the side on caution. Magic-BLAST requires much better quality alignments to call a non-canonical splice site. Otherwise it detects a lot of false positive splice sites. In future versions we plan to use genome annotation so that alignment quality restriction can be lifted for know non-canonical splice sites. We also hope to improve our aligner to give better alignments so that we can call non-canonical splice sites with more confidence. Please, let me know if these solutions would not work for you. Would a command-line option lifting alignment quality restriction work for you?

About encoding of strand information, do you mean to report it as "+" or "-" in the 5th column (PAF format), instead of bits in SAM flag? We will look into reporting PAF format or something similar in future releases.

boratyng commented 4 years ago

Could you let me know what program requires minimap2 strand encoding? Thanks.

HegedusB commented 4 years ago

I am sorry for the late answer! I am using a fungi genome from the JGI. The fasta headers of the assembly looks like this (>scaffold_9, >scaffold_90, etc.)

I am using illumina corrected nanopore reads therefore the read quality is good. An option which allows the use of the non-canonical splice site would be great.

I am using the ONT pinfish pipeline. This pipeline works perfectly with the minimap2 but can not recognize the strands information when I am using the magicblast.

Thank you very much for dealing with my problem!