hsinnan75 / GSAlign

GSAlign: an ultra-fast sequence alignment algorithm for intra-species genome comparison
MIT License
51 stars 16 forks source link

maf file output correct? #1

Closed kingralph80 closed 4 years ago

kingralph80 commented 4 years ago

Hi,

I am using GSAlign for the alignment of two diverse maize genomes with parameters: -dp -ind 100 -clr 50 -alen 200 -t 48 \ -gp /usr/bin/gnuplot \

GSAlign runs through without errors, however, when I try to convert the MAF file to PSL by kent tool mafToPsl I always get: Coordinates out of range line 311372 of B73v5_vs_Oh43_ch1_10.maf

Cheers

hsinnan75 commented 4 years ago

Thanks for reporting this issue. I'll check the Maf output.

kingralph80 commented 4 years ago

Big thanks,

Also, I really like GSAlign as it's so fast. Is there a more detailed explanation for some of the parameters? I compared the -sen mode with default but got a shorter alignment and lower ASI with -sen.

hsinnan75 commented 4 years ago

Thank you for saying that. I will keep improving GSAlign to make it more useful. I'll also put more explanation of the parameters in the help message. Thank you again.

kingralph80 commented 4 years ago

Hi I think I figured it out. It's the ---- that are added in the alignment.

When I checked the maf I found for the first line:

maf version=1

a score=1654529 s chr1 102260581 1654865 + 308452471 aaggaaggagcatgcggaagagaaaaat...

When I then checked the length of the alignment, it was 1654992 characters long. Of those 127 were ----. If you subtract that -> 1654865. So this would explain why mafToPsl says out of range and why when I make a chain file from the maf downstream tools also give errors with that chain file.

I assume this is because of gapped alignments? Do you have to count the length of the alignment with --?

hsinnan75 commented 4 years ago

It is wired. Gaps are not counted when I output the alignment length. I'll check the codes again. Thank you!

hsinnan75 commented 4 years ago

I did some tests and confirmed that the gaps are not counted. Could you please give me your sequences? I'd like to use your sequences to check the output. Thank you.

kingralph80 commented 4 years ago

Correct, they are not counted. I thought that would be the issue given the error with mafToPsl. Okay, I will share the sequences through google drive.

kingralph80 commented 4 years ago

Question: If you use any of your maf files that you generate with GSAlign you can convert it to PSL with mafToPsl from uscs tools? I am trying the convert the maf to a chain file and add GenomeAlignmentTools to improve the chain file. Since mafToPsl failed I used maf convert which did convert the GSAlign maf file to chain but if I use that chain with any tool from GenomeAlignmentTools I get errors. Sorry I get to the upload today but could you also test your maf files and if you can convert them to chain format?

hsinnan75 commented 4 years ago

I'll see if I could provide output with PSL format directly in GSAlign. Before that, I'll check my maf output using your dataset. Thank you!

hsinnan75 commented 4 years ago

@kingralph80 I run some tests on your sequences with default parameters and your parameters, and then I use mafToPsl (from kentUtils) to convert maf into psl format, there is no any warnings or errors. The ucsc-maftopsl could not be run in my server, could you please download kentUtils and run that mafToPsl with the maf file. Thank you!

kingralph80 commented 4 years ago

You are right, no error message but the file size is 0 bytes. Could you tell me which kentUtils are you referring too? https://github.com/ENCODE-DCC/kentUtils is what I used I am assuming the psl file you get from our test chr. is not 0 bytes?

Also changed to default parameters and my maf file is 200218210 with Alignment#=25114 (total alignment length=98983799) ANI=98.41%

hsinnan75 commented 4 years ago

It's wired. The PSL file I got was exactly the same number of alignments as GSAlign outputed. We downloaded the same kentUtils. And my command was mafToPsl CML332.chr10 B73.chr10 output.maf output.psl. I found the order of two reference names does not matter. They both produced the same result. I was wonder how is the coverage rate you expected?

kingralph80 commented 4 years ago

Hi. so thanks for resolving this. In the end, the error originally was that the genomes we used have the same chromosome names. All the maize once we compare use chr1 to chr10. The tools we used to compare GSAlign to like cactus, renamed them in the maf using our Input in the command line. GSAlign does not seem to not have such an input parameter. For the test, we then renamed the chroms ourselves a typo in the chromosome name resulted in the 0 byte psl output.

kingralph80 commented 4 years ago

Would it be possible to add an optional input parameter for a suffix for target and query so that if needed the same chr labels can be changed in the output maf? This way the user does not need to rename them all? In our case, its ~30 genomes but others may have many more.

hsinnan75 commented 4 years ago

Hi, I've updated GSAlign to 1.0.14. It adds ref and qry as prefix for reference and query chromosomes, respectively in the MSA/ALN file. I'll add two more parameters to let user input their own prefix/suffix for the two genomes in the future version.

kingralph80 commented 4 years ago

Argh, when I use the full genome not just chr10 we are back to the original error. MafToPsl says: "Coordinates out of range line 103592 of B73v5.CML322.50.200.maf"

I will post the line that causes the error (103592) plus the line prior in the next message but I think mafToPSL is correct. the line that causes the issue starts with: s qry.chr1 304926865 658 + 304927522

If you add 304926865 (Start) + 658 (Alignment) = 304927523, this cannot be if the whole source length is 304927522. Its off by one bp.

kingralph80 commented 4 years ago

The full alignment that causes the error in the maf file:

a score=645 s ref.chr1 308444775 654 + 308452471 gtgttcatggtttcggttttgggtctcgggATTTAGGATTTAAGGGTTTGGGGTTTAGTGTGATGTAGCTCAAGGGATGATAAAAACAACCTTACTTATTTCAGTAGCCAAAACATGGGCGGTGG----GTGGGAGGCATGAAATTGTTGTTCAATGACGAAAACATGTGTCACGGGAGGCCAAAAATAGACTTGCAGGCCTCCGGGAACCAAAACGTGTACTATGGTTTGAACACTTATGTACTATGGTTGGACACTAAATTGGTTGTTCTACCATGAAAACATGTGTTGTAGCTCACGCGAGGCCAAAAACAACCTTACCAGTCTTAGTGGCCAAAACATGGGTGGCGGACGGGAGGCATGAAATCGTTGTTCAACGACGAAAACATGTGTCGCGGGAGGCCAAAAATAGTCTTACCGACCTCCGGGGACCAAAACGTGTACTATGGTTTGGAGACTCAAGTAATCTTCTATTATGGTTTGGTCATTAAAATCTtttagtgttggtgtttagggttatggtttcaggtgtaggTttattgattagggtttggggttcggggcttagggtttagggtttggggtttggggttaagggttcgggtgtggggttcatggttttGggtttggggctcatggtttagggtttaggattTTG s qry.chr1 304926865 658 + 304927522 gtgttcatggtttcggttttgggtctcggggtttaggatttaagggtttggggtttAGTGTGATGTAGCTCAAGGGATGATAAAAACAACCTTACTTATTTCAGTAGCCAAAACATGGGCGGTGGGCGCGCGGGAGGCATGAAATTGTTGTTCAATGACGAAAACATGTGTCACGGGAGGCCAAAAATAGACTTGCAGGCCTCCGGGAACCAAAACATGTACTATGGTTTGAACACTTATGTACTATGGTTGGACACTAAATTGGTTGTTCTACCATGAAAACGTGTGTTGTAGCTCACGCGAGGCCAAAAACAACCTTACCAGTCTTAGTGGCCAAAACATGGGTGGGGGACGGGAGGCATGAAATCGTTGTTCGACGACGAAAACATGTGTAGCGGGAGGCCAAAAATAGTCTTACCGACCTCCGGGGACCAAAACGTGTACTATGGTTTGGAGACTCAAGTAATCTTCTATTATGGTTTGGTCATTAAAATCTtttagtgttggtgtttagggttatggtttcaggtgtaggattattgattagggtttggggttcggggcttagggtttagggtttggggtttggggttaagggttcgggtgtggggttcatggtttttggtttggggctcatggtttagggtttaggattTTG

hsinnan75 commented 4 years ago

Hi,

MAF is 1-base annotation, therefore the last position should be start_pos

12345678 agctagct

The above is a sequence of length 8, and the start is 1, length is 8, so the last position is 1 + 8 - 1 = 8.

I think MafToPsl is wrong about the positioning.

Hsin-Nan

On Wed, Apr 22, 2020 at 3:44 AM kingralph80 notifications@github.com wrote:

Argh, when I use the full genome not just chr10 we are back to the original error. MafToPsl says: "Coordinates out of range line 103592 of B73v5.CML322.50.200.maf"

I will post the line that causes the error (103592) plus the line prior in the next message but I think mafToPSL is correct. the line that causes the issue starts with: s qry.chr1 304926865 658 + 304927522

If you add 304926865 (Start) + 658 (Alignment) = 304927523, this cannot be if the whole source length is 304927522. Its off by one bp.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/hsinnan75/GSAlign/issues/1#issuecomment-617375282, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACOKJKORQQC3JCTWELH2MGLRNXZTVANCNFSM4MC54KEA .

kingralph80 commented 4 years ago

Ok I will post this at the kent utilities.

kingralph80 commented 4 years ago

Hi, I used the GSAlign output maf, converted it to chain (maf-convert) and wanted to use it with picard LiftOverVcf. However, the tool throws out an error that again sounds very similar: "Terminal block seen before end of chain". So I went to the UCSC definition for the maf file format:

under Lines starting with "s" -- a sequence within an alignment block start -- The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).

and under Lines starting with "e" -- information about empty parts of the alignment block start -- The start of the non-aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).

From these lines, I would assume maf is 0-based not 1-based,or ? I double-checked and a 1- based coordinate is not mentioned in the definition of maf ( https://genome.ucsc.edu/FAQ/FAQformat.html#format5 ).

HOWEVER, there is a second maf format (Mutation Annotation Format instead of Multiple Alignment format). Are there different definitions of the MAF format?

kingralph80 commented 4 years ago

Sorry for the back and forth. I really just want to be able to use the GSAlign output further.

hsinnan75 commented 4 years ago

Hi, thanks for the information. I've updated GSAlign. Now, the MAF is zero-based. Thank you!

kingralph80 commented 4 years ago

Big thanks for supporting the tool. I will try the new version right away.

kingralph80 commented 4 years ago

I check. Same error although the output looks different:

The line that caused the error: s qry_chr7 180451555 45989 - 180455515

Not sure I see the error. Does - strand means the start is the stop position? I doubled checked. The length of the chrom. 7 is 180455515.

kingralph80 commented 4 years ago

The authors of mafToPSL at UCSC confirmed that maf MAF is a 0-based. In case of this error the MAF definition is:

start -- The start of the aligning region in the source sequence. This is a zero-based number. If the strand field is "-" then this is the start relative to the reverse-complemented source sequence (see Coordinate Transforms).

So the way I understand this should the alignment not read: s qry_chr7 180405566 45989 - 180455515

180451555 - 45989 = Start position and - strand indicates its reverse complement?

hsinnan75 commented 4 years ago

Thanks for the information. And yes, the "-" indicates the reverse-complemented strand.

kingralph80 commented 4 years ago

Dear Hsinnan,

your latest commit solved this reported issue. Thanks a lot for fixing and support your great tool.