loneknightpy / idba

124 stars 53 forks source link

low rate of input reads mapping to contigs #15

Open fwhelan opened 8 years ago

fwhelan commented 8 years ago

Hello,

I'd be grateful to anyone who could give me some advice and/or help with the following. I am running a test dataset through idba_ud. The test dataset consists of ~10 bacterial species (metagen) broken into 150bp paired-end reads. I modified my idba install as explained here http://bit.ly/1P2VMlU to accomodate the input. My usage is below, but I essentially make contigs with an input file, then use bowtie2 to align the reads from my input file back to my contig.fa output of idba. When I do so, my alignment rate is roughly ~65% which is lower than I would have imagined it would be..

Is this expected? I cannot figure out or find documentation for whether idba_ud would be throwing out reads. My input data has fake, high quality scores as to ensure that that isn't an issue. I haven't played around with the bowtie2 scoring defaults.

When I do a similar test with real metagenomic data, I get a similar alignment rate. When I do a similar test with genomic data and a different assembler, the alignment rate is ~99% but with idba-ud is ~50%.

I would be happy to hear any and all opinions- thank you!

--Fiona

$fq2fa --merge --filter reads1.fastq reads2.fastq reads.fa
$idba_ud -r reads.fa -o idba_out/ --min_contig 50
number of threads 24
reads 199502
long reads 0
extra reads 0
read_length 150
kmer 20
kmers 7317208 7298126
merge bubble 3751
contigs: 42819 n50: 245 max: 2210 mean: 178 total length: 7638759 n80: 131
aligned 80918 reads
confirmed bases: 688566 correct reads: 1040 bases: 28
distance mean 240.59 sd 93.1852
seed contigs 37131 local contigs 85638
kmer 40
kmers 7340190 7312415
merge bubble 772
contigs: 33047 n50: 340 max: 3134 mean: 253 total length: 8392599 n80: 150
aligned 107120 reads
confirmed bases: 992784 correct reads: 1674 bases: 29
distance mean 264.159 sd 90.9357
seed contigs 32834 local contigs 66094
kmer 60
kmers 6839939 6812671
merge bubble 369
contigs: 28675 n50: 400 max: 3422 mean: 291 total length: 8362177 n80: 150
aligned 113868 reads
confirmed bases: 1086059 correct reads: 1827 bases: 4
distance mean 270.413 sd 89.6104
seed contigs 28675 local contigs 57350
kmer 80
kmers 6241857 6216146
merge bubble 223
contigs: 25270 n50: 435 max: 3422 mean: 320 total length: 8088398 n80: 173
aligned 114768 reads
confirmed bases: 1112629 correct reads: 1899 bases: 0
distance mean 273.898 sd 88.4771
seed contigs 25270 local contigs 50540
kmer 100
kmers 5667844 5644274
merge bubble 157
contigs: 21570 n50: 470 max: 5957 mean: 354 total length: 7642588 n80: 213
reads 199502
aligned 113829 reads
distance mean 277.28 sd 87.3132
expected coverage 6.02324e-07
edgs 27
contigs: 21543 n50: 471 max: 5957 mean: 354 total length: 7638711 n80: 213
$chdir idba_out/
$bowtie2 build contig.fa contig
$bowtie2 -x contig -1 ../reads1.fastq -2 ../reads2.fastq -S test.sam
99762 reads; of these:
  99762 (100.00%) were paired; of these:
    50873 (50.99%) aligned concordantly 0 times
    48148 (48.26%) aligned concordantly exactly 1 time
    741 (0.74%) aligned concordantly >1 times
    ----
    50873 pairs aligned concordantly 0 times; of these:
      7270 (14.29%) aligned discordantly 1 time
    ----
    43603 pairs aligned 0 times concordantly or discordantly; of these:
      87206 mates make up the pairs; of these:
        73290 (84.04%) aligned 0 times
        12294 (14.10%) aligned exactly 1 time
        1622 (1.86%) aligned >1 times
63.27% overall alignment rate
$bowtie2 -x contig -f ../reads.fa -S bowtie.sam
199502 reads; of these:
  199502 (100.00%) were unpaired; of these:
    72937 (36.56%) aligned 0 times
    121259 (60.78%) aligned exactly 1 time
    5306 (2.66%) aligned >1 times
63.44% overall alignment rate
loneknightpy commented 8 years ago

@fwhelan How do you generate the reads? What is the sequencing depth? If the depth is super low, you may want to set min_count = 1.

fwhelan commented 8 years ago

Thanks, @loneknightpy - setting the min_count=1 improved my alignment rate to ~99% with the in silico data and 88.7% with real data, which I'm happy with.

Is there a detailed manual for all of idba_ud's options on top of what's available with idba_ud -h? That might help me in the future with simple questions like these.

Thanks again!