lixin6135 / pysam

Automatically exported from code.google.com/p/pysam
0 stars 0 forks source link

BAM to Bio.SeqRecord (with quality) #137

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
This is an enhancement proposal, not a defect.

Biopython is the standard bioinformatics module in Python. Sequences with 
quality scores (e.g. Sanger phred) are supported via the SeqRecord object 
(http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc32). A function that 
converts an iterable of BAM reads into SeqRecords (including quality scores) 
would be useful.

The non-trivial part is the quality annotation, because pysam keeps them as 
ASCII codes, whereas Bio.SeqRecord stores them as integers. A conversion 
function must be written with a switch for the different quality systems 
(solexa, illumina 1.3-1.7, sanger + illumina 1.8+, etc.). The relevant code can 
be taken from Biopython at 
https://github.com/biopython/biopython/blob/master/Bio/SeqIO/QualityIO.py

I attach a stub of the function below, which only works for phred sanger + 
illumina 1.8+, but can be easily extended. It is partially copy-paste from 
QualityIO.py, but adapted to our problem.

Would you find this useful?

Original issue reported on code.google.com by iosonofa...@gmail.com on 18 Sep 2013 at 8:44

Attachments:

GoogleCodeExporter commented 8 years ago
Hi Fabio,

thank you for your suggestion. On first thought I don't think this is something 
that should be part of pysam. The sam/bam format simply stores quality scores. 
I have chosen to return a string, but this can be easily converted to integer 
values:

   quality_scores = [ ord(x) for x in read.qual ]

In my opinion, interpreting these values (such as shifting/converting) is best 
left to the user - the only person knowing the encoding scheme. Wrapping a read 
with a biopython object should be just a few lines of code.

Best wishes,
Andreas

Original comment by andreas....@gmail.com on 18 Sep 2013 at 7:04

GoogleCodeExporter commented 8 years ago
Yes, it's a few lines only, but somewhat tricky to do efficiently (see the 
code). I thought this might be generally useful as SeqRecord is the standard 
format for FASTQ in Python, but if you are not willing to maintain that 
conversion code, that's also fine.

Original comment by iosonofa...@gmail.com on 19 Sep 2013 at 7:55

GoogleCodeExporter commented 8 years ago
Hi Fabio,

thanks. I would like pysam to keep concentrating on wrapping functionality 
within 
C-samtools - without adding external dependencies.

That being said, I have just now added Fastqfile that wraps Li-Heng's kseq.h.

Best wishes,
Andreas

Original comment by andreas....@gmail.com on 24 Sep 2013 at 1:00

GoogleCodeExporter commented 8 years ago
Closed.

Original comment by andreas....@gmail.com on 26 Nov 2013 at 9:33