thackl / minidot

Fast and pretty dotplots for whole genomes assemblies using minimap and R/ggplot2
MIT License
74 stars 10 forks source link

minimap2 integration #9

Closed MichelMoser closed 7 years ago

MichelMoser commented 7 years ago

Hello, I try to run minidot on two 1.5Gb assemblies but get a segmentation fault while mapping. Could you help me with this?

The output:

Loading required package: proto
[15:08:42] /home/mmoser/minimap2/minimap2 .. ok
[15:08:42] samtools .. ok
[15:08:43] prepping .. done
[15:11:08] mode: -x asm5 -t 8 ( bp) "-x asm5 -t 8"
[15:11:08] mapping ../home/mmoser/minidot/bin/minidot: line 186: 92534 Segmentation fault      $MBIN $MOPT $TFA $TFA >> $PAF 2>> $TML
 failed
[M::mm_idx_gen::208.491*1.09] collected minimizers
[M::mm_idx_gen::212.217*1.20] sorted minimizers
[M::main::212.217*1.20] loaded/built the index for 2 target sequence(s)
[M::mm_mapopt_update::215.412*1.20] mid_occ = 753; max_occ = 3868
[M::mm_idx_stat] kmer size: 19; skip: 19; is_HPC: 0; #seq: 2
[M::mm_idx_stat::216.554*1.20] distinct minimizers: 82627568 (41.52% are singletons); average occurrences: 3.351; average spacing: 10.127

thank you, michel

thackl commented 7 years ago

Hi Michel,

this is a problem with large genomes - my scripts concatenates all contigs, and minimap has problems with GB sized sequences... But you should have a look at the minidot script that comes with https://github.com/lh3/miniasm. This should also work with minimap2.

Cheers Thomas

MichelMoser commented 7 years ago

Thanks for the quick answer! could you point me to the minidot script you mean, i could not find any in the https://github.com/lh3/miniasm repository. On twitter, i found some comments about aligning human assemblies to each other. Would be nice if you could share the commands used to perform this. best, michel

thackl commented 7 years ago

Ah, you are right, it is somewhat hidden. You need to download and make the miniasm repository. The minidot script is a binary that needs compiling.

I haven't really used that minidot tool, so I can't help you out with commands, sorry. But you might wanna check out this repo for some more information: https://github.com/zeeev/minimap#running-example-gorilla-vs-grch38

MichelMoser commented 7 years ago

Thanks for the link and information!

MichelMoser commented 7 years ago

Sorry to bother you a last time, i try to recreate the arabidopsis example running minimap and then feeding paf into minidot.R. Could you tell me what the specifics are of the length-file which is used under -l?

-l LEN per set sequence lengths

is the name of assembly and length of each sequence needed or name of each sequence and its length?

ath     19698289
ath     23459830
ath     18585056
ath     26975502
ath     366924
ath     154478
aly     33132539
aly     19320864
aly     24464547
aly     23328337
aly     21221946
aly     25113588
aly     24649197
aly     22951293
~                      

thanks , michel

thackl commented 7 years ago

Ehm, so for the comparison of two or more genomes, I concatenate all contigs of each genome into one continuous whole-genome-pseudo-contig. Otherwise minimap would also run comparisons of different chromosomes/contigs within the same genomes. During concatenation, I use the name of the assembly file as new assembly/sequence ID. The length file contains the new assembly ID and length of each original contig, so that I can add the respective break lines to the plot. Does that make sense?

cssulliv commented 4 years ago

Hi @thackl I am hoping to do the same thing as I have 4 whole genomes that contain multiple contigs. I was wondering if you could share how you concatenate them those sequences without having to go into each fasta file and manually making those changes to have file with 4 whole-genome-pseudo-contigs

thackl commented 4 years ago

I'm actually kind of doing that, i.e. I am writing a temporary fasta file with merged sequences. You can find the exact code in the main script around line 167 https://github.com/thackl/minidot/blob/master/bin/minidot. It's not very sophisticated, but should work quite robustly. What happens more or less is this:

We start with two files with multiple contigs:

# A.fa
>a1
agtgc
>a2
aggggata
>a3
catcat

# B.fa
>b1
ttgga
>b2
tgtaagattccatg

then run this snipped ...

(for fasta in `ls *.fa`; do
  # write merged seq header
  echo ">${fasta%.fa}";
  # write merged seq
  grep -v '^>' $fasta |    # read only seqs, no headers
    tr -d '\n' |           # remove newlines
    fold -w 5              # add newline at right place
  echo                     # add newline at end
done;) > merged.fa

to get one fasta file with each genome merged

>A
agtgc
agggg
ataca
tcat
>B
ttgga
tgtaa
gattc
catg

On that file, I run minimap. Hope that helps