Closed MichelMoser closed 7 years ago
Hi Michel,
this is a problem with large genomes - my scripts concatenates all contigs, and minimap has problems with GB sized sequences... But you should have a look at the minidot script that comes with https://github.com/lh3/miniasm. This should also work with minimap2.
Cheers Thomas
Thanks for the quick answer! could you point me to the minidot script you mean, i could not find any in the https://github.com/lh3/miniasm repository. On twitter, i found some comments about aligning human assemblies to each other. Would be nice if you could share the commands used to perform this. best, michel
Ah, you are right, it is somewhat hidden. You need to download and make
the miniasm repository. The minidot script is a binary that needs compiling.
I haven't really used that minidot tool, so I can't help you out with commands, sorry. But you might wanna check out this repo for some more information: https://github.com/zeeev/minimap#running-example-gorilla-vs-grch38
Thanks for the link and information!
Sorry to bother you a last time, i try to recreate the arabidopsis example running minimap and then feeding paf into minidot.R. Could you tell me what the specifics are of the length-file which is used under -l?
-l LEN per set sequence lengths
is the name of assembly and length of each sequence needed or name of each sequence and its length?
ath 19698289
ath 23459830
ath 18585056
ath 26975502
ath 366924
ath 154478
aly 33132539
aly 19320864
aly 24464547
aly 23328337
aly 21221946
aly 25113588
aly 24649197
aly 22951293
~
thanks , michel
Ehm, so for the comparison of two or more genomes, I concatenate all contigs of each genome into one continuous whole-genome-pseudo-contig. Otherwise minimap would also run comparisons of different chromosomes/contigs within the same genomes. During concatenation, I use the name of the assembly file as new assembly/sequence ID. The length file contains the new assembly ID and length of each original contig, so that I can add the respective break lines to the plot. Does that make sense?
Hi @thackl I am hoping to do the same thing as I have 4 whole genomes that contain multiple contigs. I was wondering if you could share how you concatenate them those sequences without having to go into each fasta file and manually making those changes to have file with 4 whole-genome-pseudo-contigs
I'm actually kind of doing that, i.e. I am writing a temporary fasta file with merged sequences. You can find the exact code in the main script around line 167 https://github.com/thackl/minidot/blob/master/bin/minidot. It's not very sophisticated, but should work quite robustly. What happens more or less is this:
We start with two files with multiple contigs:
# A.fa
>a1
agtgc
>a2
aggggata
>a3
catcat
# B.fa
>b1
ttgga
>b2
tgtaagattccatg
then run this snipped ...
(for fasta in `ls *.fa`; do
# write merged seq header
echo ">${fasta%.fa}";
# write merged seq
grep -v '^>' $fasta | # read only seqs, no headers
tr -d '\n' | # remove newlines
fold -w 5 # add newline at right place
echo # add newline at end
done;) > merged.fa
to get one fasta file with each genome merged
>A
agtgc
agggg
ataca
tcat
>B
ttgga
tgtaa
gattc
catg
On that file, I run minimap. Hope that helps
Hello, I try to run minidot on two 1.5Gb assemblies but get a segmentation fault while mapping. Could you help me with this?
The output:
thank you, michel