statgen / bamUtil

http://genome.sph.umich.edu/wiki/BamUtil
89 stars 30 forks source link

Indexing the clipped bam files #16

Closed ghost closed 8 years ago

ghost commented 9 years ago

Hello, I am facing a problem with indexing the clipped bam files using samtools. I am getting errors like bellow. I don't have the same problem when indexing the original bam files (Unclipped). I am not sure if I should ignore this or this is something important and I should try to solve it. Any idea what I should do to solve this problem?

read 'HS24_57:7:2110:1654:80872' mapped to '5' at POS 180043777 to 180043776 has BIN 15670 but should be 1958 In total, there are 9 reads with incorrect BIN fields Fix this by converting BAM->SAM->BAM to force BIN recalculation

pjvandehaar commented 9 years ago

Thanks for reporting this issue.

Have you tried doing what the error says? I think samtools view -h mybamfile.bam | samtools view -b -o mybamfile.new.bam - should work.

ghost commented 9 years ago

Thanks for the comment. I didn't have much hope that this work since people with the same problem did not get any result out of conversion but when I converted bam to sam to bam the indexing of the final bam file was successful. :) I used the command bellow which is the same as what you suggested. samtools view -h -o OUTPUT_SAM INPUT_BAM samtools view -Sb INPUT_SAM > OUTPUT_BAM

mktrost commented 9 years ago

What version of samtools are you using? Are you running the most recent master of bamUtil?

Can you look at the read: 'HS24_57:7:2110:1654:80872'? You can use a command like this: bamUtil/bin/bam writeRegion --readname "HS24_57:7:2110:1654:80872" --in mybamfile.bam --out -

I'd be interested in seeing this read. The error message from samtools index appears to indicate a start position of 180043777 and end position of 180043776 - which is actually smaller than the start. In theory with those values, the bin would be 1958 as it says. But if you don't make the end position less than the start (and instead make it equal), the bin would be 15670.

I'd like to fix clipOverlap if it is supplying incorrect bins, so it would be helpful to see the read to try to determine what the correct bin would be. Also, knowing the version of samtools & bamUtil would help me to recreate the issue.

ghost commented 9 years ago

Okay let me check this. you are right, there are about 10 reads like this and all of them has the end point that is smaller than start point.

Here is the information about this read: read 'HS24_57:7:2304:17611:39905' mapped to '5' at POS 67289089 to 67289088 has BIN 8788 but should be 1098

@RG ID:151795 PL:illumina PU:C672AANXX.7 LB:PX0246 SM:N/A CN:BCCAGSC @PG ID:0 VN:0.5.7 CL:bwa aln; bwa sampe HS24_57:7:2304:17611:39905 81 5 67289089 37 125M = 67289089 -125 TCATTAATTCTTTGTCAGTTGATAGTGTCTTTTATTTCTGTTTTACTTTATATTTTTCTTTCCCTCCCTTTACCAGAAACAATTCAGCCACCAGCCAACCTAGATTAGTATCGACTCGCACTCAG GGGGGGGGGGFGGGGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGFFFGGGGBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCBBB= AM:i:37 MD:Z:125 NM:i:0 RG:Z:151795 SM:i:37 X0:i:1 X1:i:0 XG:i:0 XM:i:0 XO:i:0 XT:A:U HS24_57:7:2304:17611:39905 161 5 67289089 37 125S = 67289089 125 TCATTAATTCTTTGTCAGTTGATAGTGTCTTTTATTTCTGTTTTACTTTATATTTTTCTTTCCCTCCCTTTACCAGAAACAATTCAGCCACCAGCCAACCTAGATTAGTATCGACTCGCACTCAG BBBB@GGGGGGGGGGGGGGGGFGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGEGGGGCGGGG11B>FGDEGFGG11F@BB>GEGEGE0CGGG>:F=@FEGGGFCCE@EG>GE/CGG/FG>C>FDF AM:i:37 MD:Z:125 NM:i:0 RG:Z:151795 SM:i:37 X0:i:1 X1:i:0 XG:i:0 XM:i:0 XO:i:0 XT:A:U

Samtools : 0.1.20 1.0.14 bamUtil

mktrost commented 9 years ago

Yes, that makes sense, thank you. If you run "samtools" by itself, it typically prints the version number in the general help output. I would be interested in the version of samtools, as the version I have (1.2) does not seem to complain about bin 8788, but potentially my bam could be slightly different.

A little more info on your error: your 2nd read, is entirely soft-clipped. This could potentially cause issues with some tools since an entirely soft clipped read is a bit "odd". In bamUtil version 1.14, I added an option, --unmapped, that will mark a read unmapped if it is entirely softclipped.
If you don't want to switch your version of samtools (or that doesn't fix the issue), you can either: 1) convert BAM->SAM->BAM or 2) try running clipOverlap with the --unmapped option

ghost commented 9 years ago

The version of samtools was 1.2. Why one of the reads were soft clipped? When I was running clipOverlap there was an error mentioning that one of the reads didn't have any pair. That should be the same read. 1- How will the conversion helps when there is no read pair? Do you think converting might cause some changes in the new bam file? 2- You mentioned the clipOverlap mark the read as unmapped, How this will help the indexing and not getting the errors again?

Thank you :)

ghost commented 9 years ago

Is this normal when using "unmapped" option to get this error again after clipping:

WARNING: did not find expected overlapping mates for 2 records.

It doesn't seem the unmapped option worked or helped the problem. I am still getting the same error when indexing.

mktrost commented 9 years ago

One of the reads was completely soft clipped because it is completely overlapped by it's mate. ClipOverlap's goal is to soft clip overlapping bases in one of the 2 reads. If they completely overlap, one of the reads will be entirely soft clipped.

The WARNING message should be the same in the run with or without the "unmapped" option. The warning that you see occurs when the mate information on a record appears to be incorrect. ClipOverlap will try to find the mate at the specified position. But if the mate is not there, it can't find it and that warning is produced. One reason this can occur is if you filter the reads prior to calling clipOverlap. For example, if you use samtools view to filter out low quality reads, or reads with certain flags, then clipOverlap will not see those reads. It is possible that one read in a pair is filtered, but the other is not. The non-filtered read will be unable to find it's mate, resulting in the warning message.

I want to verify that you ran clipOverlap with the --unmapped option on the original bam, not on the previous clipOverlap output. If that isn't how you ran, sorry for the confusion, please rerun on the original bam. If you did run on the original BAM, can you run the "bam writeRegion" command on your new output that has the same index error?

ghost commented 9 years ago

Thank you for clear explanation. There might be some filtering on the bam files. I did used the original bam file but I am trying it again. This is the command I used: bam clipOverlap --in original.bam --out output.bam [--unmapped] I will update the post when it's done.

ghost commented 9 years ago

Here is the error that actually related to unmapped parameter:

WARNING: did not find expected overlapping mates for 5 records. Completed ClipOverlap Successfully.

WARNING - Problems encountered parsing command line:

Command line parameter --unmapped ignored

pjvandehaar commented 9 years ago

You need to run

bam clipOverlap --in original.bam --out output.bam --unmapped

instead of

bam clipOverlap --in original.bam --out output.bam [--unmapped]

When you paste exact input and output here on Github, please surround it with three backticks (```, top left on your keyboard), so that it will be displayed without formatting. For example, when you paste your output, put lines above and below it like this:

```
WARNING: did not find expected overlapping mates for 5 records.
Completed ClipOverlap Successfully.
WARNING - 
Problems encountered parsing command line:
Command line parameter [--unmapped] (#6) ignored
```
ghost commented 9 years ago

Thanks I realized that mistake. I am running the clipOverlap and I am trying to see what is wrong with the clipped bam files and converted bam files. I am still getting the errors and using the unmapped parameters just flagged the reads but the errors are still there with the difference that the reads are flagged unmapped.

mktrost commented 9 years ago

When you run "bam writeRegion" on the new output, what does the read look like? It should now be marked as unmapped, and I think that should clear out the Cigar and the read should be treated as length 1 when calculating the bin.

So for an unmapped read with a start position of 67289088 (67289089 minus 1 when you convert to BAM), and length 1 (footnote 18 of the spec), the bin should be 8788 per section 5.3 of https://samtools.github.io/hts-specs/SAMv1.pdf beg = 67289088, end = 67289089

Is samtools still complaining even if when you look at that read it says unmapped?

ghost commented 9 years ago

This is an example of one of the latest errors: read 'HS21_152:7:2207:5154:75361' mapped to '1' at POS 65489 to 65489 has BIN 585 but should be 4684

The unmapped certainly worked and now the reads are flagged as unmapped. I have decided to use the unmapped parameter for all the samples and ignore these error messages.