Closed liu-wq closed 1 year ago
Do I need to do a few more steps here: 1.maf-swap 2.maf-sort 3.last-split 4.maf-swap 5.maf-sort 6.last-postmask
I think your script is guaranteed to have no overlapping blocks at all, in either genome. I guess your blocks actually don't overlap in mm10. The meaning of the - strand coordinates is a bit tricky, could that be it?
Just now I tried out what I was thinking. mm10 still has overlapping areas. So strand "-" might be the influencing factor. I'm not sure how the last-split algorithm determines unique hits. Does the script not consider the direction of the chain? So how do we solve this?
Sorry for the inaccurate result of the translation software, "the direction of the chain" is corrected to "the direction of the strand".
What are the criteria for removing overlapping areas? For example, what block should I keep, the longest? The biggest Score? I could also write a script to remove it specifically.
In MAF coordinates, this is always the case: end = start + len Even for - strands. (For - strands, the coordinates are in the reverse-complemented sequence.) I'm just guessing, but maybe you subtracted instead of adding?
Yes, I used subtracted. when I downloaded h38-mm10-2.maf.gz from github. And I have corrected the calculation to convert all - strand positions to + strand positions, so that your one to one file reference and query do not overlap anymore.
if strand + : start = start end = start+length
if strand - : start = chr_length - start - length end = chr_length - start
But my file still have overlap.
Maybe there's something wrong with my process. I not only split the hg38 chromosome but I also split the mm10 chromosome and ran parallel alignments.
step1:
lastal -P8 -D1e9 -m100 -p hg38.chr$i-mm10.chr$j.mat hg38.chr$i mm10.chr$j.fa |last-split -f=MAF+> hg38.chr$i-mm10.chr$j.maf
step 2:
cat hg38.chr*-mm10.chr*.maf >many2one.maf
step 3:
last-split -r many2one.maf | last-postmask > one2one.maf
Since I also split mm10, do I need to add a split step between step2 and step3.
last-split many2one.maf > many2one_2.maf
I try this step but still have overlap. : (
like this: block 1 : s h38_chr6 133726878 415 + 170805979 s NC_000067.7 192076210 479 - 195154279 the start convert to 3,077,590 , end convert to 3,078,069
block 2: s h38_chr16 20838404 565 + 90338345 s NC_000067.7 3077504 607 + 195154279 the end is 3,078,111
block2_start —— block1_start —— block1_end —— block2_end
Why? I differ from the sample files in that I split the query files, and I last-train each of the split query files.
Your adding / subtracting looks perfect now.
The problem is splitting the reference genome (hg38 in your case). You need to give the whole un-split genome to lastdb
. Otherwise it's impossible for the 1st alignment step to find globally many-to-one alignments.
Thank you! I will change the script and try it.
And I want to get collinearity across 30 mammals. To determine orthologous genes and homologous non-coding regions, the block obtained should be as long as possible. How should I balance parameters between speed and accuracy?
Now I used MAM4 and -m 100 for all mammals.
To determine orthologous genes and homologous non-coding regions, the block obtained should be as long as possible.
I would be careful about that. Some genome alignment methods try too hard to get long blocks: https://doi.org/10.1186/s13059-015-0670-9.
The maf-convert
-J
option may be useful: https://gitlab.com/mcfrith/last/-/blob/main/doc/maf-convert.rst
Now I used MAM4 and -m 100 for all mammals.
I think that's extremely slow-and-sensitive. Great if it doesn't take too much time (or memory). I would probably omit -m 100
to reduce the compute burden.
Hi, Thank you for your answer, which is very helpful to me.
The maf-convert -J option may be useful: https://gitlab.com/mcfrith/last/-/blob/main/doc/maf-convert.rst
last is a really comprehensive software! I would have written a script to handle it, but now I don't.
I think that's extremely slow-and-sensitive. Great if it doesn't take too much time (or memory). I would probably omit -m 100 to reduce the compute burden.
I have tried the alignment without splitting the ref. It's faster than splitting. Two 112-CPU servers, 14 mammals, a total of 355 tasks were run in parallel for one night. But species that are closely related to humans, such as mice and rats, have total one2one block length (~620MB) that is less than that of species such as Carnivores or ungulates (>1GB). In the last, I removed the non-chromosome parts (unplaced scaffold, mitochondria, etc.) of ref. and query. Is this the normal length? What parameters do I need to adjust to increase the total effective alignment length of rodents?
The mouse and rat genomes have evolved at a high rate (maybe because of short generation time), so they are more diverged from human than are some other mammals (e.g. dog: https://doi.org/10.1093/nar/gku104).
During evolution, many DNA blocks have been deleted, and there have been many transposon insertions, so 620MB might be what's left after that.
Hi, I have a question of one to one block.
For example, use this script to align hg38 to mm10: ` lastal -P8 -D1e9 -m100 -p my.train myDB genome2.fa |last-split -f=MAF+> many-to-one.maf
last-split -r many-to-one.maf | last-postmask > out.maf ` (The h38 was reference genome. ) In out.maf, h38 does not have overlapping blocks, but mm10 still has overlapping blocks aligned to different areas of h38.
like this: chr start end len strand chr start end len strand h38_chr1 207431560 207431874 314 + NC_000067.7 269948 269612 336 - h38_chr1 207431938 207432653 715 + NC_000067.7 270514 269810 704 -