Open jkbonfield opened 9 years ago
Other minor amendment. "For an unmapped paired-end ...".
My desire would be to avoid another change to how the bin is calculated, or a major convention change for unmapped reads.
Since the tool is smart enough to do local assembly, the tool can examine the mate unmapped flag, RNAME, and POS to determine how far back in the genome it needs to search for the missing unmapped reads. It can then do another range query for the upstream missing region and extract the missing mates. Of course, to determine how far back in the genome to look in your example the tool would only need to look for mapped mates with POS less than the 100000 to find the appropriate smallest POS for their unmapped mates. This could also be a library function in htslib with samtools. The performance concern I would have is for RNA-seq data or long-reads where we have large genomic skips (N
operator) so that the POS of the unmapped mate could be much smaller than the query's start range. There are probably other cases too.
I thought cram would return both reads regardless of cigar, or perhaps I'm missing something.
@nh13: What is "the tool"? This is a generic thing and not specific to any single tool. However you are correct that if we know a maximum read length then we can change the range query to be start-maxlen and then post-filter. It's messy though and as you say the maximum read length could be large. Putting unmapped reads into the same bin as their mapped mate is a BAM specific change and doesn't fix CRAM or indeed the common SAM text format.
@vadimzalunin: CRAM doesn't store cigar for unmapped data at all. Rather they are considered to start and end on the same base (the pos field). This means range queries can miss them whereas a range query in BAM may find them. I've seen plenty of cases where unmapped reads have a CIGAR string, due to known bugs in the aligners, which causes them to be found in range queries in BAM but not CRAM. That's not really a criticism of CRAM, but that buggy data leads to different behaviours between BAM and CRAM.
... then allow further analysis, such as local reassembly, to recover more information on the variant.
I think you can examine the reads that have POS < QSTART for their unmapped mate's POS (which is the same as their POS), and that is how much earlier you need to query. This would not need the "maximum read length". For example, if you QSTART like in your example is 100000 and a mapped read with chr1 pos 99950 with cigar 15S80M5S, then we would only need to look back as far as POS 99950.
Yes that makes sense. It's a bit cumbersome as it requires two range queries and a union of the results (hence if we have a known short maximum length then it's easier to use it to adjust a single query). Workable but clunky.
Technically indexing code may take soft clips into accounts. On 17 Jun 2015 15:22, "James Bonfield" notifications@github.com wrote:
Yes that makes sense. It's a bit cumbersome as it requires two range queries and a union of the results (hence if we have a known short maximum length then it's easier to use it to adjust a single query). Workable but clunky.
— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/87#issuecomment-112822106.
From a local assembler perspective, sam/bam/cram is not a particularly good format for local assembly of entire fragments since, in the general case, a discordantly mapped fragment could have the non-local read(s) mapped anywhere and assembly requires the full read information of the mate. In my particular case, I solved this by creating a new (non-standard) mate-coordinate sort order and copying and resorting read pairs not in proper pairs into an intermediate file. Not ideal, but the additional file is only ~10% of the original (depends on discordant mapping rate) and simultaneous traversal of both files allows for streaming local assembly (targeted local assembly would require mate-coordinate indexing as well).
On 17/06/2015 11:57 PM, James Bonfield wrote:
@nh13 https://github.com/nh13: What is "the tool"? This is a generic thing and not specific to any single tool. However you are correct that if we know a maximum read length then we can change the range query to be start-maxlen and then post-filter. It's messy though and as you say the maximum read length could be large. Putting unmapped reads into the same bin as their mapped mate is a BAM specific change and doesn't fix CRAM or indeed the common SAM text format.
@vadimzalunin https://github.com/vadimzalunin: CRAM doesn't store cigar for unmapped data at all. Rather they are considered to start and end on the same base (the pos field). This means range queries can miss them whereas a range query in BAM may find them. I've seen plenty of cases where unmapped reads have a CIGAR string, due to known bugs in the aligners, which causes them to be found in range queries in BAM but not CRAM. That's not really a criticism of CRAM, but that buggy data leads to different behaviours between BAM and CRAM.
— Reply to this email directly or view it on GitHub https://github.com/samtools/hts-specs/issues/87#issuecomment-112810383.
Current behaviour
Section 2 (recommended practices) of the SAM spec states "For a unmapped paired-end or mate-pair read whose mate is mapped, the unmapped read should have RNAME and POS identical to its mate."
Let us consider what the primary goal is behind supplying a reference and position for unmapped data. The theory is that if we have a read-pair with one end mapped and the other end unmapped, possibly due to a structural variation or being outside the parameters allowed by an aligner, then a region query can be used to fetch both ends of the sequence. These then allow further analysis, such as local reassembly, to recover more information on the variant.
This is related to https://github.com/samtools/htslib/pull/208 to fix the issue of unmapped reads with CIGAR strings (a long standing BWA bug). The cigar string is irrelevant as it represents a failed alignment, subsequently converted to unmapped. This triggers warnings in Picard, which the fix resolves.
Proposed change
Consider this example: "samtools view foo.bar chr1:100000-200000" will fetch all sequences overlapping position 100-200k into chr1 including the unmapped data, assuming no explicit filter options.
However, consider the case of a mapped read at chr1 pos 99950 with cigar 15S80M5S. It has an unmapped mate and as per recommended practices this is also at chr1 pos 99950. Our range query for chr1:100000-200000 will pull out the mapped read but not the unmapped read.
Perhaps the correct solution is to amend the spec recommendation to be "For a unmapped paired-end or mate-pair read whose mate is mapped, the unmapped read should have RNAME, POS and CIGAR identical to its mate." This would mean that a range query that pulls out the mapped read is guaranteed to also pull out the unmapped mate/pair. Unfortunately, the #208 pull request mentioned above explicitly makes this not the case. We discussed this at our weekly samtools meeting and accepted the change anyway as an interim solution, but it may need wider consideration.
Implications
Picard will start complaining again as unmapped reads have finite length, meaning the bin field changes once more.
CRAM will fail this range query, essentially falling back to existing BAM behaviour, and would need amendment or at least tacit acceptance in the inability to do this correctly. The reason is that it simply does not (and cannot) store CIGAR strings for unmapped data.
It's just a recommendation. People still need to code around the possibility of not finding a read pair when doing a range query. (This is commonly true anyway even for mapped data, without dealing with the unmapped case.)