igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
646 stars 387 forks source link

PAF file format [feature request] #478

Open sjackman opened 7 years ago

sjackman commented 7 years ago

Please consider supporting the pairwise mapping format of minimap2: https://github.com/lh3/miniasm/blob/master/PAF.md#paf-a-pairwise-mapping-format

Note that minimap2 also supports outputting SAM. It's nice not to have to do the alignments twice though (once for PAF and once for SAM).

jrobinso commented 7 years ago

As a bed or psl like feature? It would speed things along if you attached a sample file and a screenshot of what the alignments should look like.

And WHY to do tool creators have to create new formats?

sjackman commented 7 years ago

As a bed or psl like feature?

The CIGAR string may be optionally encoded using the cg:Z: tag, in which case it can be displayed like a SAM file. If there's no CIGAR string, then like a BED would be fine. No CIGAR is probably the typical case. The PAF is convertible to BED with…

gawk -F'\t' -vOFS='\t' '{ print $6, $8, $9, $1 ":" $3 "-" $4, $12, $5 }'

Percent identity $10 / $11 and query coverage ($4 - $3) / $2 are really useful to display in a tool tip.

Here's a small sample gist: https://gist.github.com/sjackman/5da1f062bfdd7526d47b4e9b088b142f I can give you a bigger one if it's helpful.

And WHY to do tool creators have to create new formats?

🙄 xkcd: Standards

jrobinso commented 7 years ago

OK. The conversion code is helpful.

jrobinso commented 7 years ago

The gist has some strange chromosome names, like "42", is that correct?

I am not going to try to parse the cg:Z tag at this time,as you indicate that no cigar is the typical case. However I point out that your gist has cigar tags.

sjackman commented 7 years ago

Thanks, Jim! Yes, the reference is a de novo assembly, so 42 is contig ID.

noncodo commented 6 years ago

@sjackman FYI you can convert SAM to PAF, check out htsbox samview in https://github.com/lh3/htsbox

jrobinso commented 6 years ago

Can you convert PAF to SAM? Apologies this fell off the radar, but to be honest its going to struggle to get priority for a while. I would, of course, accept a pull request.

On Wed, Apr 18, 2018 at 9:28 PM, Martino notifications@github.com wrote:

@sjackman https://github.com/sjackman FYI you can convert SAM to PAF, check out htsbox samview in https://github.com/lh3/htsbox

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/478#issuecomment-382606936, or mute the thread https://github.com/notifications/unsubscribe-auth/AA49HGRpMGgkX4ma_VX9kRwV8hqiKuhIks5tqBKKgaJpZM4QGwtQ .

rlorigro commented 2 years ago

Hi, I just wanted to throw in my opinion, which is that minimap2 has become extremely popular with the advent of long reads and T2T mappings. I know the full alignment PAF is not the dominant form of PAF out there but it comes up a lot in my work.

To answer your question, there is no paf-to-sam, but there is a sam-to-paf, and i could provide some simple test files containing inter-converted F and R alignments if that would help.

jrobinso commented 2 years ago

PAF looks a lot like PSL with fewer columns, and some of the column rearranged, to me. A simple script to cut/paste columns should be all that's required to confit a PAF file to PSL, which is already supported. Has anyone tried that?

http://uswest.ensembl.org/info/website/upload/psl.html

rlorigro commented 2 years ago

PAF is supported, but the cigar is not. I would just like to be able to view the indels/SNPs as they appear in a normal BAM. At the moment, when I load any PAF, whether it has a cigar tag or not, it comes up as a solid blue block, and contains no information about the alignment.

jrobinso commented 2 years ago

The following would be helpful, a small PAF file with a cigar string and a corresponding SAM file, so I can compare results, aligned to either one of our standard genomes or a fasta that can be downloaded (preferably small). If someone can do this I will try to find a few hours to work on it.

rlorigro commented 2 years ago

OK here is an archive: https://rlorigro-public-files.s3.us-west-1.amazonaws.com/reads/test/igv/test_alignments.zip

It has a reference file:

HG002.mat.cur.20211005.S7.fasta

And two alignments that are in SAM, BAM, and PAF (with cg:Z: tag).

Forward alignment:

0_1871_m_vs_S7.bam
0_1871_m_vs_S7.paf
0_1871_m_vs_S7.sam

Reverse alignment:

0_999_m_vs_S7.bam
0_999_m_vs_S7.paf
0_999_m_vs_S7.sam

And all the corresponding index files.

jrobinso commented 2 years ago

How large are these files? Are they typically (or ever) tabix indexed?

rlorigro commented 2 years ago

Do you mean PAFs in general?

PAFs can be as large as SAMs. For our purposes we have been doing assembly-to-ref alignment which only amounts to a few GB of query sequence, but it is possible that people use them for aligning high coverage (30-60x) nanopore, which would be hundreds of GB. I haven't seen anyone index them yet but I'm sure someone has.

jrobinso commented 2 years ago

Well you won't be able to load large PAFs in IGV if they are not indexed, obviously, although the best solution in that case would be to use SAM output and convert to BAM. There doesn't seem to be any BAM equivalent for PAF.

I will build in support for tabix indexed PAFs in case anyone is doing that.

rlorigro commented 2 years ago

ok that sounds great. Yes I don't know of any binarized alternative either

jrobinso commented 2 years ago

OK, I naively approached this as a parsing problem and treated the "paf" records as alignments, in the IGV sense, however the issue is larger than that. The alignment in 0_999 is ~17 MB long. The IGV alignment code is not designed currently to display alignments this long, if you try by setting the "visibility window" to a large enough value and load the BAM file it will freeze, on just a single alignment. Furthermore, zooming out to see this entire alignment would not be useful as it would appear as a solid block of indels. IGV is probably not the right tool for this.

rlorigro commented 2 years ago

That's fair, but I don't think I would ever look at the entire 17Mbp alignment at once. Regarding the PAF cigar support issue, I don't think the contents of this particular PAF should be the deciding factor, because the information content of any PAF is essentially the same as a SAM when you have the cigar tag.

jrobinso commented 2 years ago

Yes, but if I implemented support for loading PAF as an alignment, i.e. PAF == SAM, it would be a regression because PAF files that previously worked to some degree would now hang IGV. Parsing the cigar to create thick and thin blocks as in an annotation track would not work either as there is no representation of insertions in that type of track. So the real issue, that should be solved first, is IGV cannot load an alignment > a mb or so. I will create another git issue for that, when its solved I'll return to treating PAF files as alignment files.

avilella commented 2 years ago

Hi, I'd like to chip-in here, for a new user case which stems from the newly released miniprot (https://github.com/lh3/miniprot), that makes it easy to have PAF/GFF files of cross-species annotations. It would be great for IGV to display the protein sequence mismatches as detailed in the commented out PAF lines for each line in the GFF format. E.g. see cs tag and example below:

https://github.com/lh3/minimap2#the-cs-optional-tag

##PAF   ENSP00000479511.2       114     0       114     +       A2      171471747       158478209       158478678       240     342     0       AS:i:395        ms:i:430        np:i:91 da:i:0  do:i:33 cg:Z:16M130U97M cs:Z::1*gctT*gtcI:2*ttcL*tatC:2*gccG*cttF:1*cttF:3*gG~gt127ag-gt:1*gtgM*gatE:1*agtD*gtcI*tccY:4*cacY*tacL:1*tcaI:16*gagD*aacK:4*cgtQ:4*aagM*gcaE:1*cagH:7*attV:1*accS:2*gaaK*gtaG*gagD*gccL:2*ggaE:7*aagT:9*accA*agcR:2*cagH:2*ctgQ:1*ttcL:5
A2      miniprot        mRNA    158478210       158478678       430     +       .       ID=MP000001;Rank=1;Identity=0.7018;Positive=0.7982;Target=ENSP00000479511.2 1 114
A2      miniprot        CDS     158478210       158478258       54      +       0       Parent=MP000001;Rank=1;Identity=0.5882;Target=ENSP00000479511.2 1 16
A2      miniprot        CDS     158478386       158478678       376     +       2       Parent=MP000001;Rank=1;Identity=0.7216;Target=ENSP00000479511.2 17 114
##PAF   ENSP00000488775.1       114     0       0       *       *       0       0       0       0       0       0
##PAF   ENSP00000488267.1       114     0       114     +       A2      171471747       158375153       158387722       210     342     0       AS:i:339        ms:i:374        np:i:84 da:i:0  do:i:54 cg:Z:16M12230U97M       cs:Z::2*tccP*aggG:4*gtgE*atgL:1*tgtY*gtgL:3*gG~gt12227ag-gt:1*gcgV*ggtE:7*agaT:2*gtcI:1*ggcT*agcR:1*gggQ*aagQ*gccV*gtcT:1*aaaR:1*cgtS:1*atgI:3*agaS:15*ctcI*attF:1*ctcY*tttA*ggcN*cagE*gagL*acgR:1*gagS*aaaE:2*ctcF:1*agtN*cacR:2*gttG*gaaR:2*agtH:1*gacC*agcC:1*cagE*ctgM:1*atgV:1*tccA:7*tcgL:4*agcR:2
A2      miniprot        mRNA    158375154       158387722       374     +       .       ID=MP000002;Rank=1;Identity=0.6140;Positive=0.7368;Target=ENSP00000488267.1 1 114
A2      miniprot        CDS     158375154       158375202       59      +       0       Parent=MP000002;Rank=1;Identity=0.6471;Target=ENSP00000488267.1 1 16
A2      miniprot        CDS     158387430       158387722       315     +       2       Parent=MP000002;Rank=1;Identity=0.6082;Target=ENSP00000488267.1 17 114