mrvollger / SDA

Segmental Duplication Assembler (SDA).
MIT License
44 stars 6 forks source link

How to get target original subreads bam file? #11

Closed Sparkle-27 closed 4 years ago

Sparkle-27 commented 4 years ago

Dear SDA developers: I questioned the reads.orig.bam file which is the raw pacbio data. In general, the whole raw bam file usually more than 100 Gb with high coverage sequencing, using the whole file as reads.orig.bam will be time-consuming. The target collapse region is only small part of the genome which means I only need to extract the corresponding reads mapped to this region, I have got all the reads ID in this region but I have no idea to extract the target raw pacbio bam file of it. Do you have any suggestion? Thanks in advance.

mrvollger commented 4 years ago

SDA requires the reads in the particular region. So you need a whole genome alignment of all the other reads, so you can extract the relevant ones.

For example let us say your region is x:1-10000, you would run:

samtools view -F 260 whole.genome.aligned.bam 'x:1-10000' > reads.orig.bam
samtools faidx whole.genome.fasta  'x:1-10000' > ref.fasta
SDA collapse --coverage {sequencing_depth}

See SDA collapse -h for more details.

If you want to automatically run SDA on the whole genome and identify all collapses, see SDA denovo -h