Open Alfredo-Enrique opened 2 years ago
Additional Context: The complete sample has 5-6 pairs of fastqs. Per Taka's recommendation we tested with one pair (R1 and R2) and still the same issue. This suggests that the problem is not related to having multiple fastqs or running out of disk space. The current input.csv used for testing has two rows corresponding to two pairs. Feel free to remove the second row so that it goes faster when testing.
@Alfredo-Enrique have you guys run the command with --VALIDATION_STRINGENCY STRICT
instead of LENIENT
(our default) to see if it still works? (both markduplicatesSpark
and picard SortSam
).
@Alfredo-Enrique have you guys run the command with
--VALIDATION_STRINGENCY STRICT
instead ofLENIENT
(our default) to see if it still works? (bothmarkduplicatesSpark
andpicard SortSam
).
Just ran it manually to test out on the upstream intermediates. Input for this test is a 7GB bam file made of two fastq pairs of this specific cohort and which has already been sorted by the pipeline and is the intermediate before SparkMarkDuplicates step.
GATK version used: 4.1.9.0
and command used:
gatk --java-options "-Djava.io.tmpdir=/scratch" MarkDuplicatesSpark --input C19CUACXX.1.1-1.sorted.bam --output /scratch/C19CUACXX.1.1.sorted.Spark.Strigency-strict.bam --conf 'spark.local.dir=/scratch' --tmp-dir /scratch --read-validation-stringency STRICT
Output of command above: 47k size file with 0 reads per samtools view -c
Hmm, can you try two more things and point me to the logs and outputs of the NF runs?
For v8.0.0, Use --VALIDATION_STRINGENCY STRICT
and --verbosity DEBUG
for both picard SortSam
and markduplicatesSpark
. Also, add --spark-verbosity DEBUG
to markduplicatesSpark
For the main branch, we recently replaced picard SortSam
with samtools sort
for faster processing. Use the main branch and see if markduplicatesSpark
still doesn't produce any reads. Use --VALIDATION_STRINGENCY STRICT
and --verbosity DEBUG
for markduplicatesSpark
. Also, add --spark-verbosity DEBUG
to markduplicatesSpark
- For v8.0.0, Use
--VALIDATION_STRINGENCY STRICT
and--verbosity DEBUG
for bothpicard SortSam
andmarkduplicatesSpark
. Also, add--spark-verbosity DEBUG
tomarkduplicatesSpark
Tried it; Result: No difference. Still empty bam.
Main Nextflow log: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/Debug-align-DNA_v8.0.0_validation_STRICT_verbosity_DEBUG_01.log
Outputs Folder: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/output
Process Logs: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T013312Z/process-log
Nextlfow log folders: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-v8.0.0/pipeline/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T013312Z/nextflow-log
Random detail, I mislabeled the sample name suffix in the config file regards to details about the modified parameters. I changed the name of the output folder to reduce confusion, but the logs might still have the incorrect name, so just ignore it.
Correct name: ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG
Incorrect name to ignore: ICGC_BOCA_FR_SP195375_Spark_off_SortSAM_allmem_Markduplicates_allmem
samtools view -c of final bam
samtools view -c of intermediate picartds_tool_sorted.bam
- For the main branch, we recently replaced picard SortSam with samtools sort for faster processing. Use the main branch and see if markduplicatesSpark still doesn't produce any reads. Use --VALIDATION_STRINGENCY STRICT and --verbosity DEBUG for markduplicatesSpark. Also, add --spark-verbosity DEBUG to markduplicatesSpark
Tried it; Result: Also no difference. Still empty bam.
Main Nextflow log: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/Debug-align-DNA_acf77d9_validation_STRICT_verbosity_DEBUG_01.log
Outputs Folder: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/output
Process Logs: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T022148Z/process-log
Nextlfow log folders: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/output/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/align-DNA-8.0.0/ICGC_BOCA_FR_SP195375_Spark_ON_validation_STRICT_verbosity_DEBUG/log-align-DNA-8.0.0-20220725T022148Z/nextflow-log
samtools view - c main_output.bam
samtools view -c of the samtoolsorted_intermediate.bam
My working hypothesis. Problem originates with FASTQ headers.
Example of a fastq from this cohort:
@HWI-ST700660_163:1:1101:1243:1870#1@0/1
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB
@HWI-ST700660_163:1:1101:1225:1890#1@0/1
NTCAAAGCCCTTACAGGAGAAGCTGTTATCACCCCGTTTCCAGGGGGCTGGGAACCCTGGGATATGCCCAGATAGGGCTGGGGGGGTCCTCTGGAGTCCCA
+
#1:DDFFFHHHGHDIIJGIIGHGIJJJCIHHGIIJHHIIBFHIJJIJEIHGHCA=1=?AC?>AC@A:A=CA9:CCCDC95<BB##################
@HWI-ST700660_163:1:1101:1246:1943#1@0/1
CCTTCAGGTTATAGTGTCCCCAGCCCTGGCCCTCCCGCCTGGGTCTTTCTGAGGGTCTGCGCTAGGGCATCAGAATCTGGGCTTGTAAAATTCCTTAGGTG
+
CCCFFFFFHHHHHJIJJIJJJGHIIJGGHJJJJJJJIIJGGII?DFGHGIGFCCHCD@HIEHDEFBC@BCCA>>>;>:@C???B?A>((4>>>C>C#####
@HWI-ST700660_163:1:1101:1141:1992#1@0/1
AAGAGATGGCTTCTTGTGACTATAGCTTTCAAAGTAAAAGTAGTGGCCAAAGAAAACTCTTCTGGCTGTAAAAGTAAAAGTGGACTTTCCTGAAGCTGGGC
+
Compared to CPCG0196
@H801:109:D274GACXX:2:1101:1117:2064 1:N:0:
AAGANGAATATTCCGTTCTGCATTGGCAATGTCAAAGGAGATGATGCTTTAGTAAAAAGATTTCTCGATAAAGCTTTTGAAACTCAATATGTTGTCCTTGA
+
CCCF#2=DHHHHDBCAFFHHGIIIIIIIGIIIEIIIIG>@FGHIIIIIIIICFHHIIDIGE@<FGFGIIIDCGFEAHHHFFFFFEDEE@@>@DCCCDCCCC
@H801:109:D274GACXX:2:1101:1205:2070 1:N:0:
CAAANTCACAGGTTCTGGGGATTAGGCTGTAGACATTGTAGGGTAAGGGGACATTATTATTGTGCTTACCAGCAAGGATAGCCAACGGTTCCCAGCACATA
+
<??D#2=DAD8?C+A<DEEIFA@E<EEEE1CDDID@AE*?@<?9DD>ADD?@BC4=4..@=).)7@C@?)))6@D;?;A:@####################
@H801:109:D274GACXX:2:1101:1235:2080 1:N:0:
TAGANAGAGAGAAATACTAAGAGTAAAAACTTAATTAATGAAATGGAAATCAAATATACAACAAAGTTACTATTTTAAAAGACTGATATTCATAAATCTAA
+
BCCF#2ADDHHGHJJJJJJJIIJAHHIJIEHIHIIJIJIGEGHGIJIJJIEHIJIDGIHEGACGEFFGIIFGHIBCGHEHBEFFEFFFEDEDDDCEDCDC3
@H801:109:D274GACXX:2:1101:1194:2089 1:N:0:
GATTNATTTCACTACTGTAAAACAAGCTGGGCACAGTGACTCACGCCTAGAATCCCAGGACGTTGTGAGGCTGAGACTGGGGGAATGCTTGAGTGCAGAAC
+
Granted, they are different illumina fastq versions.
However, what stands out to me is the "@0" that's present towards the end of the fastq if you look at the first example. That seems to be absent in the example from wikipedia and there's nothing corresponding to it as this is betwen the index numbe and the number of the pair. Maybe whatever samtools function gatk uses to pull info from the bam header sees the "@0" and throws it out or considers them unmapped reads.
I tried going through the MarkDuplicatesSpark GATK code and samtools_library , but it's like 5-6 layers deep of different java scripts and functions called so couldn't figure it out.
However, what stands out to me is the "@0" that's present towards the end of the fastq if you look at the first example. That seems to be absent in the example from wikipedia and there's nothing corresponding to it as this is betwen the index numbe and the number of the pair. Maybe whatever samtools function gatk uses to pull info from the bam header sees the "@0" and throws it out or considers them unmapped reads.
Tried modifiying an input fastq by removing the unknown characters; Didn't work despite going through workflow to completion and outputting a bam file.
Details: I used the latest align-DNA main commit from the above examples and which has been modified to run with verbosity commands "DEBUG" and other flags as previously suggested.
I extracted a single fastq read (including Quality Scores) and the corresponding pair from the read1 and read2 files. Example here:
@HWI-ST700660_163:1:1101:1243:1870#1@0/1
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB
Then I modified the "1@" as below and made an input csv through run through the pipeline.
@HWI-ST700660_163:1:1101:1243:1870#0/1
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB
Output: Intermediate files appropiately have 2 reads as seen at the top of the image. Output files has 0.
To speed up testing: Here's an input.csv with the single pair unmodified raw: /hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/input/align-DNA.input_SP195275_single_paired_read.csv
To speed up testing: Here's an input.csv with the single pair unmodified raw:
/hot/user/alfgonzalez/pipeline/pipeline-align-DNA/pipeline-align-DNA-220722-acf77d9/input/align-DNA.input_SP195275_single_paired_read.csv
Thanks @Alfredo-Enrique. I am testing with this sample now and I'm also getting a BAM with 0 aligned reads when using mark duplicates spark. I'm working through Taka's suggestions above and will let you know if I can get any aligned reads.
Do we have any other samples/dataset that follow the header format for the failing samples? If so, we can look into testing those to confirm whether it's a header-related issue. If we don't, we could try converting the headers for the failing samples to the format that CPCG0196 uses and use that as a test to confirm whether the header is the problem.
It's the header. FOR SURE. Good problem solving gang. Shoutout to @tyamaguchi-ucla for pointing me to this: https://github.com/uclahs-cds/tool-register-dataset/issues/45
Tested with the single paired read from above, and just modified the header to be like ones from SRA file in the issue 45, and changed the lenght. It worked and went through to completion with Spark On
One of the pairs of modified fastq. Same from example in issue, just header changed:
@SRR5456220.1 1 length=101
NAGGCTTGAAAGAAAAGACTGTGCCCTTAATTGGGTTCTTGTTCTGCATTAATTAAGTTTATTGGGACTTTATTGCTCAAATGTTTTATCCCATACCTGGA
+SRR5456220.1 1 length=101
#1=DDFFFHHHHHJJJIJJJJJJJJJIJGIIIHJJCGGHJIIIGGGGHHIDGIIJEHFHJIIIEE?;FHHIEHH)=AACEEFFEFFECCAA>AACCC@ABB
Samtools view -c output shows the 2 reads:
Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers? @tyamaguchi-ucla
Additional update: Taka suggested I tried "samtools markdup". I tried it on the intermediate. Unlike GATK MarkDuplicatesSpark this does work and does give an output. Had to follow a different worfklow of fix mate pairs > resort based on position > samtools mark dup but it ended up working.
Ran it on the sorted intermediates. This is a fastq pair of interest without the modified headers.
Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers?
Maybe we just add an additional module for a "samtools markdup" specific workflow as specified by the user? That way we don't have to bother with scrubbing fastqs.
Something to note though, It is very well possible though that we might run into the same out of memory error as when using picard mark duplicates #221 (remember same cohort) . I saw a random comment (I think here) that samtools markdup is modeled after picard tools. If that's true, it's quite possible we see the same memory issue when running the whole sample.
Do we have a plan for handling these types of FASTQs? Like some sort of processing step to convert headers?
I have a few suggestions. @Alfredo-Enrique
Maybe we just add an additional module for a "samtools markdup" specific workflow as specified by the user? That way we don't have to bother with scrubbing fastqs.
We could look into it but it'll take some time to implement it and samtools might have some edge cases.
Something to note though, It is very well possible though that we might run into the same out of memory error as when using picard mark duplicates #221 (remember same cohort) . I saw a random comment (I think here) that samtools markdup is modeled after picard tools. If that's true, it's quite possible we see the same memory issue when running the whole sample.
@Alfredo-Enrique what's the total size of the FASTQs? Maybe around 70GB but we haven't confirmed. CPCG0196-F1 is about 400GB in total but it went through the pipeline.
Answer (2 years ago) from GATK Team:
You are correct, spaces and @ symbols are not allowed in read names with GATK. FASTQ format is more lenient than SAM/BAM format so following FASTQ format requirements does not guarantee your read names will work with GATK.
Could you print out an example read name from your SAM file to determine what name is being carried from your FASTQ file? Here is more information about what GATK expects in a SAM File.
You have a couple options to fix this issue. You could edit the read names in your FASTQ so that they will be in the expected SAM/BAM read name format which would be helpful for downstream processing. We don't have a GATK tool that will easily do this so you will need to look into outside solutions. Or, you can use the READ_NAME_REGEX option to specify to MarkDuplicates how to view optical duplicates in your file. This option may be more challenging to get working but it is available.
Link to created GATK issue https://github.com/broadinstitute/gatk/issues/8134
@tyamaguchi-ucla @jarbet There's been a response from GATK about this issue which I don't know enough to respond to.
Another thing to check. You say "FASTQ headers contain another @". The input to SparkDuplicates is the bam though. Can you check that that line actually made it into the bam? I.e. is there a read with the readname "HWI-ST700660_163:1:1101:1243:1870#1@0/1"?
they also made a suggestion for using the --READ_NAME_REGEX
option
By default it assumes readnames are structured like this: "MYNAME:1:22:1193:123181". There is an option --READ_NAME_REGEX to provide a custom regex to parse with which might help in your case. You might also try setting that to null to disable optical duplicate marking.
Another thing to check. You say "FASTQ headers contain another @". The input to SparkDuplicates is the bam though. Can you check that that line actually made it into the bam? I.e. is there a read with the readname "HWI-ST700660_163:1:1101:1243:1870#1@0/1"?
Really quick response! Here's how to check the BAMs. We still have the intermediate BAMs aligned using BWA-MEM2 /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/test/output_spark_debug/align-DNA-8.0.0/SP195375/BWA-MEM2-2.2.1/intermediate/align-DNA-BWA-MEM2
On one of the F2 nodes, load SAMtools by tying
module load samtools
and then type
samtools view C16MNACXX.1.1-1.bam | less (or head)
(make sure to use the piping to avoid printing out the whole BAM)
https://samtools.github.io/hts-specs/SAMv1.pdf (if you want to know more about SAM format)
It looks like the /1
was dropped from the readnames like HWI-ST700660_163:1:1101:1225:1890#1@0
but otherwise, the answer seems yes. It looks like the input FASTQs were moved somewhere and I couldn't find the original FASTQs. @Alfredo-Enrique can you point us to the original input FASTQs?
By default it assumes readnames are structured like this: "MYNAME:1:22:1193:123181". There is an option --READ_NAME_REGEX to provide a custom regex to parse with which might help in your case. You might also try setting that to null to disable optical duplicate marking.
I haven't looked into this option before but we can try that. @Alfredo-Enrique do you want to try this out when you get a chance?
It looks like the
/1
was dropped from the readnames likeHWI-ST700660_163:1:1101:1225:1890#1@0
but otherwise, the answer seems yes.
I think this is actually fine because each line contains paired end information in the BAM. (both forward /1
and reverse /2
reads)
Thanks!
So the original input was a FASTQ which was converted (with samtools?) into a bam that went into their MarkDuplicatesSpark tool, right? I will add that information along with the new header if that's the case since it seems important for reproducibility from their end.
So the original input was a FASTQ which was converted (with samtools?) into a bam that went into their MarkDuplicatesSpark tool, right?
We use either BWA-MEM2 or HISAT2 (aligner) to align reads against the reference genome. See https://github.com/uclahs-cds/pipeline-align-DNA
You can check what tools were used in the BAM header(@PG
)
samtools view -H C16MNACXX.1.1-1.bam | grep PG
@PG ID:bwa-mem2 PN:bwa-mem2 VN:2.2.1 CL:bwa-mem2 mem -t 56 -M -R @RG\tID:C16MNACXX.1.Seq1\tCN:HWI-ST700660_163\tLB:C16MNACXX.1.1\tPL:ILLUMINA\tPU:HWI-ST700660_163\tSM:SP195375 genome.fa d2ca6c9ab30faaaf8e3dbf3b7262a738.C16MNACXX_1_1_1.fastq.gz bdd9006c66e8655790451e89050f0004.C16MNACXX_1_1_2.fastq.gz
@PG ID:samtools PN:samtools PP:bwa-mem2 VN:1.12 CL:samtools view -@ 56 -S -b
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.15.1 CL:samtools view -H C16MNACXX.1.1-1.bam
I updated the information for the issue on GATK and said they will look into a possible fix.
Thank you, @madisonjordan! Very promising response!
Describe the bug I have an ICGC cohort of WGS FASTQs. When run through the align-DNA pipeline they output an empty bam file with no reads. The pipeline goes through completion without any error or issues. When looking at the upstream intermediates, the reads are there and the bam files are of the corresponding size, going all the way to MarkDuplicatesSpark step. Here, the output is an empty (47K) bam file with decoys, but 0 reads. This was confirmed by
samtools view -c
Did some testing. Took the output from the immediate step before MarkDuplicatesSpark, which is sorting, and manually ran MarkDuplicatesSpark with reduced flags; it ran and launched, but it also outputted an empty bam file without any error flags. Tried different dockers from various GATK releases and same result, empty bam file.
So possibilities 1) it could be that there's something specific about these files that causes the GATK MarkDuplicatesSpark option to output an empty file, or 2) GATK is fine and there is a bug introduced at the preceding steps that causes MarkDuplicatesSpark to output an empty bam.
python /hot/user/alfgonzalez/utility/tool-submit-nf/submit_nextflow_pipeline.py --nextflow_script /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/align-DNA.nf --nextflow_config /hot/user/alfgonzalez/pipeline/metapipeline-DNA/main/metapipeline-DNA/external/pipeline-align-DNA/pipeline/config/template.config --partition_type F72 --pipeline_run_name "TEST-align-DNA_V3.0_Metapipeline_8_Enable_true_bug_replication" --email alfgonzalez@mednet.ucla.edu
To Reproduce Steps to reproduce the behavior:
Expected behavior Will get an empty bam file with no error-outs and with 0 reads.
Screenshots N/A
Additional context