malariagen / pipelines

Pipelines for processing malaria parasite and mosquito genome sequence data.
MIT License
14 stars 13 forks source link

Add support for cram, bam, and fastq inputs to the ShortReadAlignment.wd #39

Closed gbggrant closed 4 years ago

gbggrant commented 4 years ago

Add support for cram, bam, and fastq inputs to the ShortReadAlignment.wdl

kbergin commented 4 years ago

Is there a documentation update that goes with this change? Otherwise, v exciting to see the flexibility of input types!

gbggrant commented 4 years ago

@alimanfoo I have run two pairs of tests for this. One (AV0148-C) with the small data files provided by @tnguyensanger (a cram and a pair of fastqs) and the other (AB0252-C) with a large bam and the corresponding pair of fastqs). Outputs of samtools idxstats and flagstat are below. They look okay to me - let me know if you have any concerns:

Short read alignment pipeline Small cram: AV0148-C faab4017-18fe-4a78-ba26-96a610ec666a

Idxstats:

2R  61545105    13366   4104
3R  53200684    9177    2829
2L  49364325    8838    2649
UNKN    42389979    4551    1281
3L  41963435    9289    2419
X   24393108    8056    2166
Y_unplaced  237045  25  12
Mt  15363   46  9
*   0   0   23022

Flagstat:

91839 + 0 in total (QC-passed reads + QC-failed reads)
11513 + 0 secondary
0 + 0 supplementary
4078 + 0 duplicates
53348 + 0 mapped (58.09% : N/A)
80326 + 0 paired in sequencing
40163 + 0 read1
40163 + 0 read2
3230 + 0 properly paired (4.02% : N/A)
26366 + 0 with itself and mate mapped
15469 + 0 singletons (19.26% : N/A)
18962 + 0 with mate mapped to a different chr
3537 + 0 with mate mapped to a different chr (mapQ>=5)

Small fastqs: AV0148-C 5d511047-bed2-4c8d-9507-54c5439144e7

Idxstats:

2R  61545105    13377   4023
3R  53200684    9086    2825
2L  49364325    8919    2719
UNKN    42389979    4523    1294
3L  41963435    9231    2424
X   24393108    8136    2168
Y_unplaced  237045  21  8
Mt  15363   46  8
*   0   0   23022

Flagstat:

91830 + 0 in total (QC-passed reads + QC-failed reads)
11504 + 0 secondary
0 + 0 supplementary
3993 + 0 duplicates
53339 + 0 mapped (58.08% : N/A)
80326 + 0 paired in sequencing
40163 + 0 read1
40163 + 0 read2
3232 + 0 properly paired (4.02% : N/A)
26366 + 0 with itself and mate mapped
15469 + 0 singletons (19.26% : N/A)
18862 + 0 with mate mapped to a different chr
3518 + 0 with mate mapped to a different chr (mapQ>=5)

Large bam: AB0252-C 77f099d1-425b-4021-8dcc-77f4760d23d1

IdxStats:

2R  61545105    10376172    47262
3R  53200684    9047935 48973
2L  49364325    8584117 51320
UNKN    42389979    6957095 24980
3L  41963435    7154619 41619
X   24393108    4673674 47236
Y_unplaced  237045  39852   549
Mt  15363   429553  167
*   0   0   108628

FlagStat:

47633751 + 0 in total (QC-passed reads + QC-failed reads)
1986127 + 0 secondary
0 + 0 supplementary
638900 + 0 duplicates
47263017 + 0 mapped (99.22% : N/A)
45647624 + 0 paired in sequencing
22823812 + 0 read1
22823812 + 0 read2
41954304 + 0 properly paired (91.91% : N/A)
45014784 + 0 with itself and mate mapped
262106 + 0 singletons (0.57% : N/A)
2051380 + 0 with mate mapped to a different chr
932381 + 0 with mate mapped to a different chr (mapQ>=5)

Large fastqs: AB0252-C 55ca8744-0738-48c1-830d-40b677f2dcad

IdxStats:

2R  61545105    10375873    47355
3R  53200684    9046717 49025
2L  49364325    8584887 51411
UNKN    42389979    6957552 24774
3L  41963435    7155320 41603
X   24393108    4673483 47190
Y_unplaced  237045  39865   585
Mt  15363   429562  165
*   0   0   108628

Flagstat:

47633995 + 0 in total (QC-passed reads + QC-failed reads)
1986371 + 0 secondary
0 + 0 supplementary
639268 + 0 duplicates
47263259 + 0 mapped (99.22% : N/A)
45647624 + 0 paired in sequencing
22823812 + 0 read1
22823812 + 0 read2
41953716 + 0 properly paired (91.91% : N/A)
45014780 + 0 with itself and mate mapped
262108 + 0 singletons (0.57% : N/A)
2051784 + 0 with mate mapped to a different chr
932352 + 0 with mate mapped to a different chr (mapQ>=5)
alimanfoo commented 4 years ago

@alimanfoo I have run two pairs of tests for this. One (AV0148-C) with the small data files provided by @tnguyensanger (a cram and a pair of fastqs) and the other (AB0252-C) with a large bam and the corresponding pair of fastqs). Outputs of samtools idxstats and flagstat are below. They look okay to me - let me know if you have any concerns

Thanks @gbggrant, just to say these look good to me too.