Open sjackman opened 7 years ago
As a bed or psl like feature? It would speed things along if you attached a sample file and a screenshot of what the alignments should look like.
And WHY to do tool creators have to create new formats?
As a bed or psl like feature?
The CIGAR string may be optionally encoded using the cg:Z:
tag, in which case it can be displayed like a SAM file. If there's no CIGAR string, then like a BED would be fine. No CIGAR is probably the typical case. The PAF is convertible to BED with…
gawk -F'\t' -vOFS='\t' '{ print $6, $8, $9, $1 ":" $3 "-" $4, $12, $5 }'
Percent identity $10 / $11
and
query coverage ($4 - $3) / $2
are really useful to display in a tool tip.
Here's a small sample gist: https://gist.github.com/sjackman/5da1f062bfdd7526d47b4e9b088b142f I can give you a bigger one if it's helpful.
And WHY to do tool creators have to create new formats?
OK. The conversion code is helpful.
The gist has some strange chromosome names, like "42", is that correct?
I am not going to try to parse the cg:Z tag at this time,as you indicate that no cigar is the typical case. However I point out that your gist has cigar tags.
Thanks, Jim! Yes, the reference is a de novo assembly, so 42
is contig ID.
@sjackman FYI you can convert SAM to PAF, check out htsbox samview in https://github.com/lh3/htsbox
Can you convert PAF to SAM? Apologies this fell off the radar, but to be honest its going to struggle to get priority for a while. I would, of course, accept a pull request.
On Wed, Apr 18, 2018 at 9:28 PM, Martino notifications@github.com wrote:
@sjackman https://github.com/sjackman FYI you can convert SAM to PAF, check out htsbox samview in https://github.com/lh3/htsbox
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/igvteam/igv/issues/478#issuecomment-382606936, or mute the thread https://github.com/notifications/unsubscribe-auth/AA49HGRpMGgkX4ma_VX9kRwV8hqiKuhIks5tqBKKgaJpZM4QGwtQ .
Hi, I just wanted to throw in my opinion, which is that minimap2 has become extremely popular with the advent of long reads and T2T mappings. I know the full alignment PAF is not the dominant form of PAF out there but it comes up a lot in my work.
To answer your question, there is no paf-to-sam, but there is a sam-to-paf, and i could provide some simple test files containing inter-converted F and R alignments if that would help.
PAF looks a lot like PSL with fewer columns, and some of the column rearranged, to me. A simple script to cut/paste columns should be all that's required to confit a PAF file to PSL, which is already supported. Has anyone tried that?
PAF is supported, but the cigar is not. I would just like to be able to view the indels/SNPs as they appear in a normal BAM. At the moment, when I load any PAF, whether it has a cigar tag or not, it comes up as a solid blue block, and contains no information about the alignment.
The following would be helpful, a small PAF file with a cigar string and a corresponding SAM file, so I can compare results, aligned to either one of our standard genomes or a fasta that can be downloaded (preferably small). If someone can do this I will try to find a few hours to work on it.
OK here is an archive: https://rlorigro-public-files.s3.us-west-1.amazonaws.com/reads/test/igv/test_alignments.zip
It has a reference file:
HG002.mat.cur.20211005.S7.fasta
And two alignments that are in SAM, BAM, and PAF (with cg:Z:
tag).
Forward alignment:
0_1871_m_vs_S7.bam
0_1871_m_vs_S7.paf
0_1871_m_vs_S7.sam
Reverse alignment:
0_999_m_vs_S7.bam
0_999_m_vs_S7.paf
0_999_m_vs_S7.sam
And all the corresponding index files.
How large are these files? Are they typically (or ever) tabix indexed?
Do you mean PAFs in general?
PAFs can be as large as SAMs. For our purposes we have been doing assembly-to-ref alignment which only amounts to a few GB of query sequence, but it is possible that people use them for aligning high coverage (30-60x) nanopore, which would be hundreds of GB. I haven't seen anyone index them yet but I'm sure someone has.
Well you won't be able to load large PAFs in IGV if they are not indexed, obviously, although the best solution in that case would be to use SAM output and convert to BAM. There doesn't seem to be any BAM equivalent for PAF.
I will build in support for tabix indexed PAFs in case anyone is doing that.
ok that sounds great. Yes I don't know of any binarized alternative either
OK, I naively approached this as a parsing problem and treated the "paf" records as alignments, in the IGV sense, however the issue is larger than that. The alignment in 0_999 is ~17 MB long. The IGV alignment code is not designed currently to display alignments this long, if you try by setting the "visibility window" to a large enough value and load the BAM file it will freeze, on just a single alignment. Furthermore, zooming out to see this entire alignment would not be useful as it would appear as a solid block of indels. IGV is probably not the right tool for this.
That's fair, but I don't think I would ever look at the entire 17Mbp alignment at once. Regarding the PAF cigar support issue, I don't think the contents of this particular PAF should be the deciding factor, because the information content of any PAF is essentially the same as a SAM when you have the cigar tag.
Yes, but if I implemented support for loading PAF as an alignment, i.e. PAF == SAM, it would be a regression because PAF files that previously worked to some degree would now hang IGV. Parsing the cigar to create thick and thin blocks as in an annotation track would not work either as there is no representation of insertions in that type of track. So the real issue, that should be solved first, is IGV cannot load an alignment > a mb or so. I will create another git issue for that, when its solved I'll return to treating PAF files as alignment files.
Hi, I'd like to chip-in here, for a new user case which stems from the newly released miniprot (https://github.com/lh3/miniprot), that makes it easy to have PAF/GFF files of cross-species annotations. It would be great for IGV to display the protein sequence mismatches as detailed in the commented out PAF lines for each line in the GFF format. E.g. see cs tag and example below:
https://github.com/lh3/minimap2#the-cs-optional-tag
##PAF ENSP00000479511.2 114 0 114 + A2 171471747 158478209 158478678 240 342 0 AS:i:395 ms:i:430 np:i:91 da:i:0 do:i:33 cg:Z:16M130U97M cs:Z::1*gctT*gtcI:2*ttcL*tatC:2*gccG*cttF:1*cttF:3*gG~gt127ag-gt:1*gtgM*gatE:1*agtD*gtcI*tccY:4*cacY*tacL:1*tcaI:16*gagD*aacK:4*cgtQ:4*aagM*gcaE:1*cagH:7*attV:1*accS:2*gaaK*gtaG*gagD*gccL:2*ggaE:7*aagT:9*accA*agcR:2*cagH:2*ctgQ:1*ttcL:5
A2 miniprot mRNA 158478210 158478678 430 + . ID=MP000001;Rank=1;Identity=0.7018;Positive=0.7982;Target=ENSP00000479511.2 1 114
A2 miniprot CDS 158478210 158478258 54 + 0 Parent=MP000001;Rank=1;Identity=0.5882;Target=ENSP00000479511.2 1 16
A2 miniprot CDS 158478386 158478678 376 + 2 Parent=MP000001;Rank=1;Identity=0.7216;Target=ENSP00000479511.2 17 114
##PAF ENSP00000488775.1 114 0 0 * * 0 0 0 0 0 0
##PAF ENSP00000488267.1 114 0 114 + A2 171471747 158375153 158387722 210 342 0 AS:i:339 ms:i:374 np:i:84 da:i:0 do:i:54 cg:Z:16M12230U97M cs:Z::2*tccP*aggG:4*gtgE*atgL:1*tgtY*gtgL:3*gG~gt12227ag-gt:1*gcgV*ggtE:7*agaT:2*gtcI:1*ggcT*agcR:1*gggQ*aagQ*gccV*gtcT:1*aaaR:1*cgtS:1*atgI:3*agaS:15*ctcI*attF:1*ctcY*tttA*ggcN*cagE*gagL*acgR:1*gagS*aaaE:2*ctcF:1*agtN*cacR:2*gttG*gaaR:2*agtH:1*gacC*agcC:1*cagE*ctgM:1*atgV:1*tccA:7*tcgL:4*agcR:2
A2 miniprot mRNA 158375154 158387722 374 + . ID=MP000002;Rank=1;Identity=0.6140;Positive=0.7368;Target=ENSP00000488267.1 1 114
A2 miniprot CDS 158375154 158375202 59 + 0 Parent=MP000002;Rank=1;Identity=0.6471;Target=ENSP00000488267.1 1 16
A2 miniprot CDS 158387430 158387722 315 + 2 Parent=MP000002;Rank=1;Identity=0.6082;Target=ENSP00000488267.1 17 114
Please consider supporting the pairwise mapping format of minimap2: https://github.com/lh3/miniasm/blob/master/PAF.md#paf-a-pairwise-mapping-format
Note that minimap2 also supports outputting SAM. It's nice not to have to do the alignments twice though (once for PAF and once for SAM).