lh3 / minimap

This repo is DEPRECATED. Please use minimap2, the successor of minimap.
https://github.com/lh3/minimap2
MIT License
106 stars 29 forks source link

output format #5

Closed zeeev closed 8 years ago

zeeev commented 8 years ago

Dear Heng,

I'm not sure the output matches the README. I'm guessing the output format is:

  1. query name
  2. q start
  3. q end
  4. q strand
  5. target name
  6. t start
  7. t end
  8. number of matching bases?
  9. 255 ? I have no idea what this is?
  10. number of co-linear minimizers?

000000F_quiver_patched 36230911 1450029 1466108 - chr19 58617616 192435 209606 8619 17171 255 cm:i:1248

Thanks.

--Zev

lh3 commented 8 years ago

See manpage: minimap.1.

255 is the mapping quality, but minimap does not compute it.

zeeev commented 8 years ago

More Qs:

Are you chaining in both directions? i.e. no overlapping query?

-O depends on -r right? Is -r only in query space?

Thanks.

lh3 commented 8 years ago

Are you chaining in both directions?

Yes. Requiring co-linearity.

i.e. no overlapping query?

In one hit, the minimizers are strictly co-linear. However, different hits may have overlaps on query – if that is what you mean by overlapping.

-O depends on -r right?

Yes.

Is -r only in query space?

-r is in the "diagonal space". Say query and database sequences are on the same strand (for simplicity). Hits are clustered based on "x-y", where x is the coordinate on the query and y the coordinate on the database sequence.

lh3 commented 8 years ago

different hits may have overlaps on query

By this I mean minimap is a multi-mapper. BWA-MEM is a best mapper by default. BWA-MEM may work as a multi-mapper, but it often misses hits when there are more than several similar hits.

zeeev commented 8 years ago

I'm mapping human contigs in the Mb+ size to grch38. I'm using minimap to identity large stretches of contiguity then using BWA mem to align through those regions. I'm concerned about loosing large inversions and deletions if I only use BWA mem. Is that reasonable? I'd like to not have overlapping contigs relative to the reference genomes.

lh3 commented 8 years ago

I don't have enough experience to give a good recommendation. My gut feeling is for large events, minimap will do better as it gives you most long hits. However, if you use the default -r, you may lose smaller events as those minimizer matches will be grouped with the larger chain. Also, bwa-mem does not work well for Mb+ contigs. Too slow. I believe BLASR will be better.

zeeev commented 8 years ago

Thanks for all your help.

minimap followed by blasr or mem should recover all of the small events.