Trouble detecting ITDs larger than 33bp

tungers commented 4 years ago

Hi,

I have been testing getitd and it has been working well at detecting FLT3 ITDs in our clinical hybrid capture NGS data. However, when I test getitd in more detail with in silico generated data to evaluate the sensitivity and size limits, it seems to be missing a good proportion of the ITDs compared to other programs like Pindel.

At first I thought it was because the ITDs I generated were not necessarily in-frame, so they were getting filtered out by getitd. However, this does not seem to be the case as it is able to detect in-frame insertions for a few shorter ITDs but nothing above 24bp.

I attached two cases below where one has a 24bp ITD and the other has a 33bp ITD (both at 0.9 VAF, starting at chr13:28608231). Getitd was able to detect the 24bp ITD but not the 33bp ITD with default settings. Both ITDs can be clearly seen in the IGV pileup (I attached the respective BAM files and getitd output files too). Do you have any insight as to why getitd is missing ITDs larger than 33bp?

Thanks in advance, Jack

tjblaette commented 4 years ago

Hi Jack,

this is interesting. I'll try to have a look later this week and get back to you as soon as I have. getITD can definitely detect larger ITDs as well, so something must be amiss, but I'm sure we'll figure it out.

Stay safe, Tamara

tungers commented 4 years ago

Hi Tamara,

Sounds great! That would be very much appreciated.

Best regards, Jack

ydy1127 commented 4 years ago

Hi Jack,

I have a problem similar to yours. getITD has been shown to be effective in the detection of clinical hybridization captured data. However, when I applied it to the simulated data, I did not identify any ITDs.

I have been thinking about the problem with the simulation data. Would it be convenient for you to communicate with me about the generation method of simulated data? I would appreciate it.

Thank you very much!

Best wishes Danyang Yuan

tungers commented 4 years ago

Hi Jack,

I have a problem similar to yours. getITD has been shown to be effective in the detection of clinical hybridization captured data. However, when I applied it to the simulated data, I did not identify any ITDs.

I have been thinking about the problem with the simulation data. Would it be convenient for you to communicate with me about the generation method of simulated data? I would appreciate it.

Thank you very much!

Best wishes Danyang Yuan

Hi Danyang,

That's interesting. I generated my data by adding different size FLT3 ITDs to hybrid capture data using a program called insilico Mutator (https://www.sciencedirect.com/science/article/pii/S1525157818301065). It seems to work as described and I can see the ITDs both visually in the IGV pileups and with other programs like Pindel, so I am also wondering why getITD is having some trouble with this data. getITD was able to detect larger ITDs in some of the clinical samples I tested with no problem.

How are you generating your in silico data?

Thanks, Jack

tjblaette commented 4 years ago

Hi Jack,

I've looked at your data and noticed a few things. Starting with the 24bp ITD:

General data quality is high and 97% of reads pass the BQS filter. That's good. However, almost all of your reads are filtered out before the actual variant calling: Less than 3% of reads remain for the analysis. While it's expected in a hybrid capture experiment that most reads won't cover the ITD site, which will therefore be filtered by getITD based on their alignment score, many reads are actually lost at the other steps. We need to adjust default parameters:

getITD was optimized for an amplicon-based assay where all amplicons are generated by the same, known primer pair. With default parameters, getITD therefore requires that all reads start or end with one of these known primers. In your hybrid-capture / simulation data, that is of course not the case. getITD is therefore needlessly discarding all reads that don't happen to start exactly where our assay's primers start. To turn this filter off, use -require_indel_free_primers False
Another implication of "all reads start on the same coordinates" is that we can expect most reads to actually have identical sequences, as long as they are not altered by sequencing errors. That is why another default parameter filters out all those reads whose sequences are unique. Again, because your reads have variable 5' coordinates, we want to turn this filter off, using: -min_read_copies 1

This gets us to ~ 750 reads covering the ITD site (instead of ~ 65) and a VAF of 50% (instead of 8%). The VAF being close to half of what's expected, got me suspicious that something was wrong with your forward/ reverse reads. getITD expects all forward and all reverse reads to be respectively located in the same FASTQ file, but I find both forward and reverse reads in your R1 fastq file, for example. Using the -infer_sense_from_alignment True parameter we can rescue these mixed reads and get more than 1000 reads in total covering the ITD site, with a VAF of 60%. This is still quite off from the expected 90% though, so we must be missing more (mutated) reads.

Two questions:

How did you extract reads from the BAM into the FASTQ files? Did you include both reads that i) map to the FLT3 reference region covered by getITD and ii) were NOT mapped by the aligner generating that BAM file? This is what I would recommend.
Many of your reads appear to have some kind of adapter sequences attached, is that right/ expected? (See for example grep GATCGGAAGAGCGT dup24.bam_R2.fastq) If you know your adapter sequences, you may pass these to getITD to help filter out potentially associated false positives, using -forward_adapter XXX and -reverse_adapter YYY. As these sequences appear to be quite long and frequent, you might consider trimming them, to check whether they are causing problems here. (While we did find adapters in our data at times, these were much much shorter, generally encompassing just one or two bp. 1/3 of your R2 reads contain this sequence up above, and mostly longer stretches of it. This could definitely be a problem.)

Anyway, the commands at this point are python getitd/getitd.py -require_indel_free_primers False -min_read_copies 1 -infer_sense_from_alignment True dup24_woutPrimerFilter_woutUniqueReadsFilter_inferSense dup24*fastq python getitd/getitd.py -require_indel_free_primers False -min_read_copies 1 -infer_sense_from_alignment True dup33_woutPrimerFilter_woutUniqueReadsFilter_inferSense dup33*fastq

Using this command then on the 33bp ITD sample, you do call the expected ITD (though VAF is 44%).

Hope this helps for a start. Let me know about questions 1 & 2! Tamara

tjblaette commented 4 years ago

Hi Danyang,

if you feel like the comment above regarding Jack's data does not help you solve your problem, feel free to open another issue. I'll be happy to help.

All the best, Tamara

tungers commented 4 years ago

Hi Jack,

I've looked at your data and noticed a few things. Starting with the 24bp ITD:

General data quality is high and 97% of reads pass the BQS filter. That's good. However, almost all of your reads are filtered out before the actual variant calling: Less than 3% of reads remain for the analysis. While it's expected in a hybrid capture experiment that most reads won't cover the ITD site, which will therefore be filtered by getITD based on their alignment score, many reads are actually lost at the other steps. We need to adjust default parameters:

getITD was optimized for an amplicon-based assay where all amplicons are generated by the same, known primer pair. With default parameters, getITD therefore requires that all reads start or end with one of these known primers. In your hybrid-capture / simulation data, that is of course not the case. getITD is therefore needlessly discarding all reads that don't happen to start exactly where our assay's primers start. To turn this filter off, use -require_indel_free_primers False

Another implication of "all reads start on the same coordinates" is that we can expect most reads to actually have identical sequences, as long as they are not altered by sequencing errors. That is why another default parameter filters out all those reads whose sequences are unique. Again, because your reads have variable 5' coordinates, we want to turn this filter off, using: -min_read_copies 1

This gets us to ~ 750 reads covering the ITD site (instead of ~ 65) and a VAF of 50% (instead of 8%). The VAF being close to half of what's expected, got me suspicious that something was wrong with your forward/ reverse reads. getITD expects all forward and all reverse reads to be respectively located in the same FASTQ file, but I find both forward and reverse reads in your R1 fastq file, for example. Using the -infer_sense_from_alignment True parameter we can rescue these mixed reads and get more than 1000 reads in total covering the ITD site, with a VAF of 60%. This is still quite off from the expected 90% though, so we must be missing more (mutated) reads.

Two questions:

How did you extract reads from the BAM into the FASTQ files? Did you include both reads that i) map to the FLT3 reference region covered by getITD and ii) were NOT mapped by the aligner generating that BAM file? This is what I would recommend.

Many of your reads appear to have some kind of adapter sequences attached, is that right/ expected? (See for example grep GATCGGAAGAGCGT dup24.bam_R2.fastq) If you know your adapter sequences, you may pass these to getITD to help filter out potentially associated false positives, using -forward_adapter XXX and -reverse_adapter YYY. As these sequences appear to be quite long and frequent, you might consider trimming them, to check whether they are causing problems here. (While we did find adapters in our data at times, these were much much shorter, generally encompassing just one or two bp. 1/3 of your R2 reads contain this sequence up above, and mostly longer stretches of it. This could definitely be a problem.)

Anyway, the commands at this point are python getitd/getitd.py -require_indel_free_primers False -min_read_copies 1 -infer_sense_from_alignment True dup24_woutPrimerFilter_woutUniqueReadsFilter_inferSense dup24*fastq python getitd/getitd.py -require_indel_free_primers False -min_read_copies 1 -infer_sense_from_alignment True dup33_woutPrimerFilter_woutUniqueReadsFilter_inferSense dup33*fastq

Using this command then on the 33bp ITD sample, you do call the expected ITD (though VAF is 44%).

Hope this helps for a start. Let me know about questions 1 & 2! Tamara

Hi Tamara,

Thanks so much for taking the time to respond! Your explanations make a lot of sense and the changes to the settings have indeed increased the sensitivity of getITD in detecting the ITDs in my hybrid capture data. I am now able to detect ITDs up to ~100bp. Although we are nearing our read length of 150bp, I'm under the impression that the maximum ITD length able to be detected by getitd has not been reached? I attached the fastq and getitd output files for a 105bp ITD (starting at the same position as before) that was not detected by getitd using the new settings.

data.zip

To answer your questions:

The fastq files I'm analyzing were generated by a program called insilico Mutator (https://github.com/thesushantpatil/insiM), which takes a normal (unmutated) BAM file as input and adds the specified ITD to the reads overlapping the specified region. The output fastq files that I am analyzing with getITD should therefore contain all the mutated reads (mapped or unmapped) that were generated.
Thanks for the tip! I will try filtering out the false positive reads I'm seeing based on the adapter sequences.

Thanks again for your helpful replies!

Best, Jack

tjblaette commented 4 years ago

Hi Jack,

I can take a look but I'm pretty sure the problem is related to those adapter sequences (or whatever else they may be). If I check their length, I find them to be 33 bp long, on average, in your R2 file. Now, the theoretical maximum ITD length that getITD could detect using 150 bp reads, would be around 144 bp (150 bp read length - 6 bp min insert length). This calculation, however, assumes that all of your 150 bp actually cover proper gene sequence. If 33 bp contain adapter sequence, then we'd only expect those reads to enable identification of 150 - 6 - 33 bp = 111 bp. This is quite close to 105 bp.

I will take a look and see, if I notice anything else in the 105 bp sample. However, I believe it would be best to solve this adapter problem already for the 24 bp case. Still about 1/3 of the ITD reads there are missed (thus VAF=60% instead of 90%). It might be a good idea to

run getITD on the raw hybrid capture data, prior to you adding in ITDs, to identify any problems arising from the data itself. Check any insertion that comes up.
check for the corresponding adapter sequence(s) in R1. If there are such long and frequent sequences in the R2 file, there is very likely also at least one present in the R1 file. (If you don't know them a priori, check insertions that come up when running the actual data, see prev. point)
try another sample to base your simulations on, if such a high average length and frequency of adapter sequences is not expected / typical
run getITD on your simulated data after trimming all of these adapter sequences, and check whether that fixes your VAF estimates for the 24 bp and 33 bp cases (if it does, this might be a viable option for any future sample of yours)
run getITD on your simulated data after filtering out all adapter-containing reads, to check whether that fixes VAFs.

Let me know if there is anything else I can do to help, Tamara

ydy1127 commented 4 years ago

Hi Jack,

Thank you very much for sharing your generation method with me. I followed the simulation idea of the following paper "Genomon ITDetector: a tool for somatic internal tandem duplication detection from cancer genome sequencing data". I think your simulation method is more efficient.

At the author's suggestion, I also identified previously unidentified itds. Another question, I would like to ask you how you observed ITD in IGV. In my case, if the length of ITD is greater than about 40bp, it will not be reported as Indels, and it cannot be observed in IGV.

Would it be convenient for you to provide your email address? I still have some questions to discuss with you. Thank you very much!

Best wishes Danyang Yuan

tjblaette / getitd

Trouble detecting ITDs larger than 33bp #6