mcfrith / last-genome-alignments

48 stars 5 forks source link

Is it correct that there are overlapping blocks of query in one-to-one Maf file? #17

Closed liu-wq closed 1 year ago

liu-wq commented 1 year ago

Hi, I have a question of one to one block.
For example, use this script to align hg38 to mm10: ` lastal -P8 -D1e9 -m100 -p my.train myDB genome2.fa |last-split -f=MAF+> many-to-one.maf

last-split -r many-to-one.maf | last-postmask > out.maf ` (The h38 was reference genome. ) In out.maf, h38 does not have overlapping blocks, but mm10 still has overlapping blocks aligned to different areas of h38.

like this: chr start end len strand chr start end len strand h38_chr1 207431560 207431874 314 + NC_000067.7 269948 269612 336 - h38_chr1 207431938 207432653 715 + NC_000067.7 270514 269810 704 -

liu-wq commented 1 year ago

Do I need to do a few more steps here: 1.maf-swap 2.maf-sort 3.last-split 4.maf-swap 5.maf-sort 6.last-postmask

mcfrith commented 1 year ago

I think your script is guaranteed to have no overlapping blocks at all, in either genome. I guess your blocks actually don't overlap in mm10. The meaning of the - strand coordinates is a bit tricky, could that be it?

liu-wq commented 1 year ago

Just now I tried out what I was thinking. mm10 still has overlapping areas. So strand "-" might be the influencing factor. I'm not sure how the last-split algorithm determines unique hits. Does the script not consider the direction of the chain? So how do we solve this?

liu-wq commented 1 year ago

Sorry for the inaccurate result of the translation software, "the direction of the chain" is corrected to "the direction of the strand".

liu-wq commented 1 year ago

What are the criteria for removing overlapping areas? For example, what block should I keep, the longest? The biggest Score? I could also write a script to remove it specifically.

mcfrith commented 1 year ago

In MAF coordinates, this is always the case: end = start + len Even for - strands. (For - strands, the coordinates are in the reverse-complemented sequence.) I'm just guessing, but maybe you subtracted instead of adding?

liu-wq commented 1 year ago

Yes, I used subtracted. when I downloaded h38-mm10-2.maf.gz from github. And I have corrected the calculation to convert all - strand positions to + strand positions, so that your one to one file reference and query do not overlap anymore.

if strand + : start = start end = start+length

if strand - : start = chr_length - start - length end = chr_length - start

But my file still have overlap.

liu-wq commented 1 year ago

Maybe there's something wrong with my process. I not only split the hg38 chromosome but I also split the mm10 chromosome and ran parallel alignments.

step1: lastal -P8 -D1e9 -m100 -p hg38.chr$i-mm10.chr$j.mat hg38.chr$i mm10.chr$j.fa |last-split -f=MAF+> hg38.chr$i-mm10.chr$j.maf step 2: cat hg38.chr*-mm10.chr*.maf >many2one.maf step 3: last-split -r many2one.maf | last-postmask > one2one.maf

Since I also split mm10, do I need to add a split step between step2 and step3. last-split many2one.maf > many2one_2.maf

liu-wq commented 1 year ago

I try this step but still have overlap. : (

like this: block 1 : s h38_chr6 133726878 415 + 170805979 s NC_000067.7 192076210 479 - 195154279 the start convert to 3,077,590 , end convert to 3,078,069

block 2: s h38_chr16 20838404 565 + 90338345 s NC_000067.7 3077504 607 + 195154279 the end is 3,078,111

block2_start —— block1_start —— block1_end —— block2_end

Why? I differ from the sample files in that I split the query files, and I last-train each of the split query files.

mcfrith commented 1 year ago

Your adding / subtracting looks perfect now.

The problem is splitting the reference genome (hg38 in your case). You need to give the whole un-split genome to lastdb. Otherwise it's impossible for the 1st alignment step to find globally many-to-one alignments.

liu-wq commented 1 year ago

Thank you! I will change the script and try it.

And I want to get collinearity across 30 mammals. To determine orthologous genes and homologous non-coding regions, the block obtained should be as long as possible. How should I balance parameters between speed and accuracy?

Now I used MAM4 and -m 100 for all mammals.

mcfrith commented 1 year ago

To determine orthologous genes and homologous non-coding regions, the block obtained should be as long as possible.

I would be careful about that. Some genome alignment methods try too hard to get long blocks: https://doi.org/10.1186/s13059-015-0670-9.

The maf-convert -J option may be useful: https://gitlab.com/mcfrith/last/-/blob/main/doc/maf-convert.rst

Now I used MAM4 and -m 100 for all mammals.

I think that's extremely slow-and-sensitive. Great if it doesn't take too much time (or memory). I would probably omit -m 100 to reduce the compute burden.

liu-wq commented 1 year ago

Hi, Thank you for your answer, which is very helpful to me.

The maf-convert -J option may be useful: https://gitlab.com/mcfrith/last/-/blob/main/doc/maf-convert.rst

last is a really comprehensive software! I would have written a script to handle it, but now I don't.

I think that's extremely slow-and-sensitive. Great if it doesn't take too much time (or memory). I would probably omit -m 100 to reduce the compute burden.

I have tried the alignment without splitting the ref. It's faster than splitting. Two 112-CPU servers, 14 mammals, a total of 355 tasks were run in parallel for one night. But species that are closely related to humans, such as mice and rats, have total one2one block length (~620MB) that is less than that of species such as Carnivores or ungulates (>1GB). In the last, I removed the non-chromosome parts (unplaced scaffold, mitochondria, etc.) of ref. and query. Is this the normal length? What parameters do I need to adjust to increase the total effective alignment length of rodents?

mcfrith commented 1 year ago

The mouse and rat genomes have evolved at a high rate (maybe because of short generation time), so they are more diverged from human than are some other mammals (e.g. dog: https://doi.org/10.1093/nar/gku104).

During evolution, many DNA blocks have been deleted, and there have been many transposon insertions, so 620MB might be what's left after that.