lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.76k stars 405 forks source link

can I get PAF output converted to blast6 format #504

Closed splaisan closed 4 years ago

splaisan commented 4 years ago

In the QIIME workflow, I am trying to replace the very slow and not appropriate vsearch classification of long-read metagenomics sequencing (ONT) by minimap2 which is very fast for this kind of reads. QIIME expects blast6 output to proceed with the results and I do not find an easy way to convert the minimap2 PAF to proper blastn output 6 (some fields are not shared by both tabular reports and the column order is different) Has anyone done this already and if yes with which code and or tools. Thanks for sharing

lh3 commented 4 years ago

Not possible unfortunately. Minimap2 doesn't compute E-value.

splaisan commented 4 years ago

Thanks for your quick answer Heng, would it be possible for the other columns with an arbitrary Evalue of 0? (or produce a pseudo Evalue and pseuso bit score from the mapping quality score I am not sure the Evalue is used downstream by Qiime anyway (not sure at all so I will ask now)

lh3 commented 4 years ago

I don't know if Qiime is using E-value. Bit score is close to E-value. Minimap2 doesn't output that either. You should be able to get the rest of information if you run minimap2 with -c --cs.

tseemann commented 4 years ago

Would be a useful addition to have a paftools paf2blast6tool. I may consider implementing it.

BLAST6 format:

1. | qseqid | query (e.g., gene) sequence id
-- | -- | --
2. | sseqid | subject (e.g., reference genome) sequence id
3. | pident | percentage of identical matches
4. | length | alignment length
5. | mismatch | number of mismatches
6. | gapopen | number of gap openings
7. | qstart | start of alignment in query
8. | qend | end of alignment in query
9. | sstart | start of alignment in subject
10. | send | end of alignment in subject
11. | evalue | expect value
12. | bitscore | bit score

PAF:


Col | Type | Description
-- | -- | --
1 | string | Query sequence name
2 | int | Query sequence length
3 | int | Query start (0-based; BED-like; closed)
4 | int | Query end (0-based; BED-like; open)
5 | char | Relative strand: "+" or "-"
6 | string | Target sequence name
7 | int | Target sequence length
8 | int | Target start on original strand (0-based)
9 | int | Target end on original strand (0-based)
10 | int | Number of residue matches
11 | int | Alignment block length
12 | int | Mapping quality (0-255; 255 for missing)
Tag | Type | Description
-- | -- | --
tp | A | Type of aln: P/primary, S/secondary and I,i/inversion
cm | i | Number of minimizers on the chain
s1 | i | Chaining score
s2 | i | Chaining score of the best secondary chain
NM | i | Total number of mismatches and gaps in the alignment
MD | Z | To generate the ref sequence in the alignment
AS | i | DP alignment score
ms | i | DP score of the max scoring segment in the alignment
nn | i | Number of ambiguous bases in the alignment
ts | A | Transcript strand (splice mode only)
cg | Z | CIGAR string (only in PAF)
cs | Z | Difference string
dv | f | Approximate per-base sequence divergence
de | f | Gap-compressed per-base sequence divergence
rl | i | Length of query regions harboring repetitive seeds
andrem01 commented 4 years ago

Thanks for your quick answer Heng, would it be possible for the other columns with an arbitrary Evalue of 0? (or produce a pseudo Evalue and pseuso bit score from the mapping quality score I am not sure the Evalue is used downstream by Qiime anyway (not sure at all so I will ask now)

the e-value scores are used by deblur (https://github.com/biocore/deblur) during the positive filtering mode.

splaisan commented 4 years ago

Thanks to all who answered here.

@tseemann , if you could make time for this and there would be a way to mimic Evalue and bitscore from the mapping quality and the content of the cigar or other accessory files, this would definitely be great to have.

I believe that more and more users of long reads (ONT and PB) would benefit from aligning them to references with a rapid software like MM2 rather than blast their reads (QIIME) or use other still slower resources (Vsearch remains very slow for me with my used non-optimal settings - >3 days for 57k 4.4kb qry against 22k ref).

splaisan commented 4 years ago

Thanks for your quick answer Heng, would it be possible for the other columns with an arbitrary Evalue of 0? (or produce a pseudo Evalue and pseuso bit score from the mapping quality score I am not sure the Evalue is used downstream by Qiime anyway (not sure at all so I will ask now)

the e-value scores are used by deblur (https://github.com/biocore/deblur) during the positive filtering mode.

Thanks for this info, I did not know of deblur (I am new to QIIME too) and will look at it closer.

splaisan commented 4 years ago

Some data to fuel the discussion. I hope that this is the place to put this kind of content.

I aligned all deduplicated reads to the rrnDB database with the following minimap2 command:

minimap2 -c --cs -N 5 --secondary=no -x map-ont -t 4 reference-sequences.fasta dna-sequences.fasta > minimap2_out.paf

It completed on 4cpu in less than 9min (I did not even pre-build the index). By comparison, the vsearch command on 84 cpu is still running after 2 days now and at ~70% of inputs (I know that I asked it a lot with --maxrejects 4096 but I wanted to cover a large portion of the database)

vsearch --usearch_global dna-sequences.fasta \
  --id 0.77 \
  --query_cov 0.8 \
  --strand both \
  --maxaccepts 1 \
  --maxrejects 4096 \
  --output_no_hits \
  --db reference-sequences.fasta \
  --threads 84 \
  --blast6out fasta_out

Rapid inspection of the first 100 query results with vsearch and minimap2

results shows a relative global match for the qry_hit pairs and first 100 queries. Divergent mappings may still map to the same bacterial species though.

venn counts based on the merge strings read-id_hit-id:

When this approach can be improved with for instance extra minimap2 stringency (or less!), and when the recall is acceptable, the difference in runtime is spectacular and would really boost metagenomics analysis of long reads.

lh3 commented 4 years ago

Minimap2 is not optimized for sensitivity. BLAST with 11-mer seeds is definitely more sensitive than the map-ont preset. BLASTZ and GraphMap are even more sensitive.

lh3 commented 4 years ago

Minimap2 may be tuned to be a little more sensitive, but by design, it won't be as sensitive as BLASTZ and GraphMap.

tseemann commented 4 years ago

this is totally expected - vsearch is not designed for noisy long reads.

splaisan commented 4 years ago

@tseemann thanks This is exactly why I am teasing people here and there since a month now to get more appropriate tools in qiime ( which is what I am learning now to analyse rDna classification on ONT from 1.4 to 4.4kb 16s its 23s amplicons). Minimap is so efficient that it was my first candidate but the qiime team does not have resources on short term to look at it and I am not enough of a pythonner to do it all by myself from the qiime source so I turned to the community. I am sure it will become the standard very soon :-) there too.

tseemann commented 4 years ago

@splaisan have you used ONT's WIMP/EPI2ME pipeline? also i saw this: https://f1000research.com/articles/7-1755/v2

andrem01 commented 4 years ago

@splaisan I agree with Torsten's suggestion of following Cusco et al., as the best way forward for analysing nanopore-derived 16S rRNA gene data. If you would like to use downstream statistical analyses within q2, you might be able to convert your feature table (e.g., BIOM file produced through pre-processing using minimap2) and import as a qza.