uclahs-cds / pipeline-align-DNA

Nextflow pipeline to align paired DNA FASTQs and sort, mark duplicates, and index resulting alignment
https://uclahs-cds.github.io/pipeline-align-DNA/
GNU General Public License v2.0
4 stars 1 forks source link

GATK's MarkDuplicatesSpark outputs and empty bam file for an ICGC cohort of WGS FASTQs #223

Open Alfredo-Enrique opened 1 year ago

Alfredo-Enrique commented 1 year ago

Describe the bug I have an ICGC cohort of WGS FASTQs. When run through the align-DNA pipeline they output an empty bam file with no reads. The pipeline goes through completion without any error or issues. When looking at the upstream intermediates, the reads are there and the bam files are of the corresponding size, going all the way to MarkDuplicatesSpark step. Here, the output is an empty (47K) bam file with decoys, but 0 reads. This was confirmed by samtools view -c

Did some testing. Took the output from the immediate step before MarkDuplicatesSpark, which is sorting, and manually ran MarkDuplicatesSpark with reduced flags; it ran and launched, but it also outputted an empty bam file without any error flags. Tried different dockers from various GATK releases and same result, empty bam file.

So possibilities 1) it could be that there's something specific about these files that causes the GATK MarkDuplicatesSpark option to output an empty file, or 2) GATK is fine and there is a bug introduced at the preceding steps that causes MarkDuplicatesSpark to output an empty bam.

To Reproduce Steps to reproduce the behavior:

  1. Clone align-DNA 8.0.0 release into an appropriately named folder where you wish to debug
  2. Modify the template.config with the specified iput.csv above and to match the template.config from my run. Alternatively you can just copy the template.config specified above and modify the output to somewhere where you want to debug.
  3. Use python submission script to launch with desired config from above.
  4. Wait 1-2 hr to finish with no errors, and look at the outputs.

Expected behavior Will get an empty bam file with no error-outs and with 0 reads.

Screenshots N/A

Additional context

Alfredo-Enrique commented 1 year ago

Additional Context: The complete sample has 5-6 pairs of fastqs. Per Taka's recommendation we tested with one pair (R1 and R2) and still the same issue. This suggests that the problem is not related to having multiple fastqs or running out of disk space. The current input.csv used for testing has two rows corresponding to two pairs. Feel free to remove the second row so that it goes faster when testing.

tyamaguchi-ucla commented 1 year ago

@Alfredo-Enrique have you guys run the command with --VALIDATION_STRINGENCY STRICT instead of LENIENT (our default) to see if it still works? (both markduplicatesSpark and picard SortSam).

https://github.com/uclahs-cds/pipeline-align-DNA/blob/acf77d90ed11d89bade0e3b910d86cd9f04830e2/module/mark_duplicates_spark.nf#L78

Alfredo-Enrique commented 1 year ago

@Alfredo-Enrique have you guys run the command with --VALIDATION_STRINGENCY STRICT instead of LENIENT (our default) to see if it still works? (both markduplicatesSpark and picard SortSam).

https://github.com/uclahs-cds/pipeline-align-DNA/blob/acf77d90ed11d89bade0e3b910d86cd9f04830e2/module/mark_duplicates_spark.nf#L78

Just ran it manually to test out on the upstream intermediates. Input for this test is a 7GB bam file made of two fastq pairs of this specific cohort and which has already been sorted by the pipeline and is the intermediate before SparkMarkDuplicates step.

GATK version used: 4.1.9.0 and command used: gatk --java-options "-Djava.io.tmpdir=/scratch" MarkDuplicatesSpark --input C19CUACXX.1.1-1.sorted.bam --output /scratch/C19CUACXX.1.1.sorted.Spark.Strigency-strict.bam --conf 'spark.local.dir=/scratch' --tmp-dir /scratch --read-validation-stringency STRICT

Output of command above: 47k size file with 0 reads per samtools view -c

image
tyamaguchi-ucla commented 1 year ago

Hmm, can you try two more things and point me to the logs and outputs of the NF runs?

  1. For v8.0.0, Use --VALIDATION_STRINGENCY STRICT and --verbosity DEBUG for both picard SortSam and markduplicatesSpark. Also, add --spark-verbosity DEBUG to markduplicatesSpark

  2. For the main branch, we recently replaced picard SortSam with samtools sort for faster processing. Use the main branch and see if markduplicatesSpark still doesn't produce any reads. Use --VALIDATION_STRINGENCY STRICT and --verbosity DEBUG for markduplicatesSpark. Also, add --spark-verbosity DEBUG to markduplicatesSpark

Alfredo-Enrique commented 1 year ago
  1. For v8.0.0, Use --VALIDATION_STRINGENCY STRICT and --verbosity DEBUG for both picard SortSam and markduplicatesSpark. Also, add --spark-verbosity DEBUG to markduplicatesSpark

Tried it; Result: No difference. Still empty bam.

Main Nextflow log: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/Debug-align-DNA_v8.0.0_validation_STRICT_verbosity_DEBUG_01.log

Outputs Folder: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/output

Process Logs: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T013312Z/process-log

Nextlfow log folders: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T013312Z/nextflow-log

Random detail, I mislabeled the sample name suffix in the config file regards to details about the modified parameters. I changed the name of the output folder to reduce confusion, but the logs might still have the incorrect name, so just ignore it.

Correct name: ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG Incorrect name to ignore: ICGC_BOCA_FR_SP195375_Spark_off_SortSAM_allmem_Markduplicates_allmem

samtools view -c of final bam

image

samtools view -c of intermediate picartds_tool_sorted.bam

image
  1. For the main branch, we recently replaced picard SortSam with samtools sort for faster processing. Use the main branch and see if markduplicatesSpark still doesn't produce any reads. Use --VALIDATION_STRINGENCY STRICT and --verbosity DEBUG for markduplicatesSpark. Also, add --spark-verbosity DEBUG to markduplicatesSpark

Tried it; Result: Also no difference. Still empty bam.

Main Nextflow log: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/Debug-align-DNA_acf77d9_validation_STRICT_verbosity_DEBUG_01.log

Outputs Folder: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/output

Process Logs: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T022148Z/process-log

Nextlfow log folders: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T022148Z/nextflow-log

samtools view - c main_output.bam

image

samtools view -c of the samtoolsorted_intermediate.bam

image
Alfredo-Enrique commented 1 year ago

My working hypothesis. Problem originates with FASTQ headers.

Example of a fastq from this cohort:

@HWI-ST700660_163:1:1101:1243:1870#1@0/1
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB
@HWI-ST700660_163:1:1101:1225:1890#1@0/1
NTCAAAGCCCTTACAGGAGAAGCTGTTATCACCCCGTTTCCAGGGGGCTGGGAACCCTGGGATATGCCCAGATAGGGCTGGGGGGGTCCTCTGGAGTCCCA
+
#1:DDFFFHHHGHDIIJGIIGHGIJJJCIHHGIIJHHIIBFHIJJIJEIHGHCA=1=?AC?>AC@A:A=CA9:CCCDC95<BB##################
@HWI-ST700660_163:1:1101:1246:1943#1@0/1
CCTTCAGGTTATAGTGTCCCCAGCCCTGGCCCTCCCGCCTGGGTCTTTCTGAGGGTCTGCGCTAGGGCATCAGAATCTGGGCTTGTAAAATTCCTTAGGTG
+
CCCFFFFFHHHHHJIJJIJJJGHIIJGGHJJJJJJJIIJGGII?DFGHGIGFCCHCD@HIEHDEFBC@BCCA>>>;>:@C???B?A>((4>>>C>C#####
@HWI-ST700660_163:1:1101:1141:1992#1@0/1
AAGAGATGGCTTCTTGTGACTATAGCTTTCAAAGTAAAAGTAGTGGCCAAAGAAAACTCTTCTGGCTGTAAAAGTAAAAGTGGACTTTCCTGAAGCTGGGC
+

Compared to CPCG0196

@H801:109:D274GACXX:2:1101:1117:2064 1:N:0:
AAGANGAATATTCCGTTCTGCATTGGCAATGTCAAAGGAGATGATGCTTTAGTAAAAAGATTTCTCGATAAAGCTTTTGAAACTCAATATGTTGTCCTTGA
+
CCCF#2=DHHHHDBCAFFHHGIIIIIIIGIIIEIIIIG>@FGHIIIIIIIICFHHIIDIGE@<FGFGIIIDCGFEAHHHFFFFFEDEE@@>@DCCCDCCCC
@H801:109:D274GACXX:2:1101:1205:2070 1:N:0:
CAAANTCACAGGTTCTGGGGATTAGGCTGTAGACATTGTAGGGTAAGGGGACATTATTATTGTGCTTACCAGCAAGGATAGCCAACGGTTCCCAGCACATA
+
<??D#2=DAD8?C+A<DEEIFA@E<EEEE1CDDID@AE*?@<?9DD>ADD?@BC4=4..@=).)7@C@?)))6@D;?;A:@####################
@H801:109:D274GACXX:2:1101:1235:2080 1:N:0:
TAGANAGAGAGAAATACTAAGAGTAAAAACTTAATTAATGAAATGGAAATCAAATATACAACAAAGTTACTATTTTAAAAGACTGATATTCATAAATCTAA
+
BCCF#2ADDHHGHJJJJJJJIIJAHHIJIEHIHIIJIJIGEGHGIJIJJIEHIJIDGIHEGACGEFFGIIFGHIBCGHEHBEFFEFFFEDEDDDCEDCDC3
@H801:109:D274GACXX:2:1101:1194:2089 1:N:0:
GATTNATTTCACTACTGTAAAACAAGCTGGGCACAGTGACTCACGCCTAGAATCCCAGGACGTTGTGAGGCTGAGACTGGGGGAATGCTTGAGTGCAGAAC
+

Granted, they are different illumina fastq versions.

However, what stands out to me is the "@0" that's present towards the end of the fastq if you look at the first example. That seems to be absent in the example from wikipedia and there's nothing corresponding to it as this is betwen the index numbe and the number of the pair. Maybe whatever samtools function gatk uses to pull info from the bam header sees the "@0" and throws it out or considers them unmapped reads.

I tried going through the MarkDuplicatesSpark GATK code and samtools_library , but it's like 5-6 layers deep of different java scripts and functions called so couldn't figure it out.

Alfredo-Enrique commented 1 year ago

However, what stands out to me is the "@0" that's present towards the end of the fastq if you look at the first example. That seems to be absent in the example from wikipedia and there's nothing corresponding to it as this is betwen the index numbe and the number of the pair. Maybe whatever samtools function gatk uses to pull info from the bam header sees the "@0" and throws it out or considers them unmapped reads.

Tried modifiying an input fastq by removing the unknown characters; Didn't work despite going through workflow to completion and outputting a bam file.

Details: I used the latest align-DNA main commit from the above examples and which has been modified to run with verbosity commands "DEBUG" and other flags as previously suggested.

I extracted a single fastq read (including Quality Scores) and the corresponding pair from the read1 and read2 files. Example here:

@HWI-ST700660_163:1:1101:1243:1870#1@0/1
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB

Then I modified the "1@" as below and made an input csv through run through the pipeline.

@HWI-ST700660_163:1:1101:1243:1870#0/1
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB

Output: Intermediate files appropiately have 2 reads as seen at the top of the image. Output files has 0.

image

To speed up testing: Here's an input.csv with the single pair unmodified raw: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/input/align-DNA.input_SP195275_single_paired_read.csv

jarbet commented 1 year ago

To speed up testing: Here's an input.csv with the single pair unmodified raw: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/input/align-DNA.input_SP195275_single_paired_read.csv

Thanks @Alfredo-Enrique. I am testing with this sample now and I'm also getting a BAM with 0 aligned reads when using mark duplicates spark. I'm working through Taka's suggestions above and will let you know if I can get any aligned reads.

yashpatel6 commented 1 year ago

Do we have any other samples/dataset that follow the header format for the failing samples? If so, we can look into testing those to confirm whether it's a header-related issue. If we don't, we could try converting the headers for the failing samples to the format that CPCG0196 uses and use that as a test to confirm whether the header is the problem.

Alfredo-Enrique commented 1 year ago

It's the header. FOR SURE. Good problem solving gang. Shoutout to @tyamaguchi-ucla for pointing me to this: https://github.com/uclahs-cds/tool-register-dataset/issues/45

Tested with the single paired read from above, and just modified the header to be like ones from SRA file in the issue 45, and changed the lenght. It worked and went through to completion with Spark On

One of the pairs of modified fastq. Same from example in issue, just header changed:

@SRR5456220.1 1 length=101
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+SRR5456220.1 1 length=101
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB

Samtools view -c output shows the 2 reads:

image
yashpatel6 commented 1 year ago

Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers? @tyamaguchi-ucla

Alfredo-Enrique commented 1 year ago

Additional update: Taka suggested I tried "samtools markdup". I tried it on the intermediate. Unlike GATK MarkDuplicatesSpark this does work and does give an output. Had to follow a different worfklow of fix mate pairs > resort based on position > samtools mark dup but it ended up working.

Ran it on the sorted intermediates. This is a fastq pair of interest without the modified headers.

image
Alfredo-Enrique commented 1 year ago

Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers?

Maybe we just add an additional module for a "samtools markdup" specific workflow as specified by the user? That way we don't have to bother with scrubbing fastqs.

Something to note though, It is very well possible though that we might run into the same out of memory error as when using picard mark duplicates #221 (remember same cohort) . I saw a random comment (I think here) that samtools markdup is modeled after picard tools. If that's true, it's quite possible we see the same memory issue when running the whole sample.

tyamaguchi-ucla commented 1 year ago

Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers?

I have a few suggestions. @Alfredo-Enrique

Maybe we just add an additional module for a "samtools markdup" specific workflow as specified by the user? That way we don't have to bother with scrubbing fastqs.

We could look into it but it'll take some time to implement it and samtools might have some edge cases.

Something to note though, It is very well possible though that we might run into the same out of memory error as when using picard mark duplicates #221 (remember same cohort) . I saw a random comment (I think here) that samtools markdup is modeled after picard tools. If that's true, it's quite possible we see the same memory issue when running the whole sample.

@Alfredo-Enrique what's the total size of the FASTQs? Maybe around 70GB but we haven't confirmed. CPCG0196-F1 is about 400GB in total but it went through the pipeline.

madisonjordan commented 1 year ago

Related GATK community post

Answer (2 years ago) from GATK Team:

You are correct, spaces and @ symbols are not allowed in read names with GATK. FASTQ format is more lenient than SAM/BAM format so following FASTQ format requirements does not guarantee your read names will work with GATK.

Could you print out an example read name from your SAM file to determine what name is being carried from your FASTQ file? Here is more information about what GATK expects in a SAM File.

You have a couple options to fix this issue. You could edit the read names in your FASTQ so that they will be in the expected SAM/BAM read name format which would be helpful for downstream processing. We don't have a GATK tool that will easily do this so you will need to look into outside solutions. Or, you can use the READ_NAME_REGEX option to specify to MarkDuplicates how to view optical duplicates in your file. This option may be more challenging to get working but it is available.

madisonjordan commented 1 year ago

Link to created GATK issue https://github.com/broadinstitute/gatk/issues/8134

madisonjordan commented 1 year ago

@tyamaguchi-ucla @jarbet There's been a response from GATK about this issue which I don't know enough to respond to.

Another thing to check. You say "FASTQ headers contain another @". The input to SparkDuplicates is the bam though. Can you check that that line actually made it into the bam? I.e. is there a read with the readname "HWI-ST700660_163:1:1101:1243:1870#1@0/1"?

they also made a suggestion for using the --READ_NAME_REGEX option

By default it assumes readnames are structured like this: "MYNAME:1:22:1193:123181". There is an option --READ_NAME_REGEX to provide a custom regex to parse with which might help in your case. You might also try setting that to null to disable optical duplicate marking.

tyamaguchi-ucla commented 1 year ago

Another thing to check. You say "FASTQ headers contain another @". The input to SparkDuplicates is the bam though. Can you check that that line actually made it into the bam? I.e. is there a read with the readname "HWI-ST700660_163:1:1101:1243:1870#1@0/1"?

Really quick response! Here's how to check the BAMs. We still have the intermediate BAMs aligned using BWA-MEM2 /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/test/output_spark_debug/align-DNA-8.0.0/SP195375/BWA-MEM2-2.2.1/intermediate/align-DNA-BWA-MEM2

On one of the F2 nodes, load SAMtools by tying module load samtools and then type samtools view C16MNACXX.1.1-1.bam | less (or head) (make sure to use the piping to avoid printing out the whole BAM)

https://samtools.github.io/hts-specs/SAMv1.pdf (if you want to know more about SAM format)

It looks like the /1 was dropped from the readnames like HWI-ST700660_163:1:1101:1225:1890#1@0 but otherwise, the answer seems yes. It looks like the input FASTQs were moved somewhere and I couldn't find the original FASTQs. @Alfredo-Enrique can you point us to the original input FASTQs?

By default it assumes readnames are structured like this: "MYNAME:1:22:1193:123181". There is an option --READ_NAME_REGEX to provide a custom regex to parse with which might help in your case. You might also try setting that to null to disable optical duplicate marking.

I haven't looked into this option before but we can try that. @Alfredo-Enrique do you want to try this out when you get a chance?

tyamaguchi-ucla commented 1 year ago

It looks like the /1 was dropped from the readnames like HWI-ST700660_163:1:1101:1225:1890#1@0 but otherwise, the answer seems yes.

I think this is actually fine because each line contains paired end information in the BAM. (both forward /1 and reverse /2 reads)

madisonjordan commented 1 year ago

Thanks!

So the original input was a FASTQ which was converted (with samtools?) into a bam that went into their MarkDuplicatesSpark tool, right? I will add that information along with the new header if that's the case since it seems important for reproducibility from their end.

tyamaguchi-ucla commented 1 year ago

So the original input was a FASTQ which was converted (with samtools?) into a bam that went into their MarkDuplicatesSpark tool, right?

We use either BWA-MEM2 or HISAT2 (aligner) to align reads against the reference genome. See https://github.com/uclahs-cds/pipeline-align-DNA

tyamaguchi-ucla commented 1 year ago

You can check what tools were used in the BAM header(@PG)

samtools view -H C16MNACXX.1.1-1.bam | grep PG

@PG     ID:bwa-mem2     PN:bwa-mem2     VN:2.2.1        CL:bwa-mem2 mem -t 56 -M -R @RG\tID:C16MNACXX.1.Seq1\tCN:HWI-ST700660_163\tLB:C16MNACXX.1.1\tPL:ILLUMINA\tPU:HWI-ST700660_163\tSM:SP195375 genome.fa d2ca6c9ab30faaaf8e3dbf3b7262a738.C16MNACXX_1_1_1.fastq.gz bdd9006c66e8655790451e89050f0004.C16MNACXX_1_1_2.fastq.gz
@PG     ID:samtools     PN:samtools     PP:bwa-mem2     VN:1.12 CL:samtools view -@ 56 -S -b
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.15.1       CL:samtools view -H C16MNACXX.1.1-1.bam
madisonjordan commented 1 year ago

I updated the information for the issue on GATK and said they will look into a possible fix.

tyamaguchi-ucla commented 1 year ago

Thank you, @madisonjordan! Very promising response!