Closed BoudSTER closed 3 years ago
Hi Augustin,
thanks for testing getITD and thanks checking in with me! I will have a look at the protocol and get back to you with some recommendations. Can you roughly estimate the coverage your assay achieves on the FLT3 target region?
It would be great if getITD worked on existing panel workflows as well!
All the best, Tamara
Hi tjblaette,
Thank you for your response.
On average over the FLT3-ITD region (chr13:28577411-28674713) I have 117868 reads. I am very far from your milion ^^.
Have a nice day :).
Augustin.
Hi Augustin,
there are two important parameters to adjust as follows:
-require_indel_free_primers False
: We don't have specific primers in the SureSelect protocol, so we cannot filter on these.-min_read_copies 1
: Sequencing reads start at variable positions in the target region, so there may not always be exact duplicates of each read, especially as your coverage is much lower than for our assay. Just be aware that this could make getITD significantly slower, especially at the alignment stage.Both of these parameters normally help to reduce false positives, so you might have to adjust filter parameters at the end, such as -filter_ins_vaf
.
One parameter that I'm not sure about is -infer_sense_from_alignment
. You can keep it at False
if your R1 FASTQ file reads all map directly to the reference amplicon while your R2 reads have to be reverse complemented to map. Setting it to True
might make things slower, but keeping it at False
when it shouldn't be can erroneously discard up to half of your reads. If you are not sure, try both ways and see if it makes a difference in the number of reads that pass the alignment stage.
Also, while it should be possible to process complete FASTQ files, I recommend extracting the relevant reads first, as getITD will otherwise likely run for an extremely long time. These will be the reads that were successfully aligned to the FLT3 target region by another aligner such as BWA, as well as the ones that could not be aligned at all. Please let me know if you need help with this. Chromosomal coordinates for the default reference section can be found in the anno/amplicon_kayser.tsv
file. Depending on your read distribution, it might be worth trying a larger reference. In that case, extent anno/amplicon.txt
and anno/amplicon_kayser.tsv
as necessary.
Please let me know how these work for you and whether there is anything else I can do to help. Were you having any specific issues so far? The title mentions "unexpected results"?
All the best, Tamara
Hi tjblaette,
After your modifications it works better !
Initially using getitd "basically": Out of 145 positive ITDs with the reference method getitd finds 18.
After your modifications of the settings: -infer_sense_from_alignment True Out of 145 positive ITDs with the getitd reference method, 41 are found.
-infer_sense_from_alignment False
getitd raise an error
Command output:
Turned OFF indel-free primer filter!
Total reads remaining for analysis: 1697 (2.1798330122029546 %)
Calculating coverage took 0.12398674301221035 s
-- Looking for insertions & ITDs --
Collecting inserts took 0.10209453999414109 s
83 insertions were found
0/83 insertions were part of adapters and filtered
Filtering inserts for adapter sequences took 0.11129739001626149 s
Annotating coverage took 0.006243465992156416 s
Found an ITD too long for the respective read (likely false positive):
ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.
ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.
ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.
Found an ITD too long for the respective read (likely false positive):
ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.
Found an ITD too long for the respective read (likely false positive):
ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.
Found an ITD too long for the respective read (likely false positive):
ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.
Found an ITD too long for the respective read (likely false positive):
ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.
Collecting ITDs took 0.5573262049874756 s
35 ITDs were found
-- Merging insertions --
47 insertions remain after merging
31 insertions remain after merging
27 insertions remain after merging
Command error:
Traceback (most recent call last):
File "/opt/getitd-1.2.1/getitd.py", line 2050, in
main(config)
File "/opt/getitd-1.2.1/getitd.py", line 2027, in main
merged_ins_anditds[type] = get_mergedinserts(inserts, type_, ref_coverage, config)
File "/opt/getitd-1.2.1/getitd.py", line 1797, in get_merged_inserts
to_merge = merge(to_merge, condition, ref_coverage, config)
File "/opt/getitd-1.2.1/getitd.py", line 1513, in merge
minsert_collection = minsert_collection.merge(insert_collection, ref_coverage)
File "/opt/getitd-1.2.1/getitd.py", line 1127, in merge
self = self.set_representative()
File "/opt/getitd-1.2.1/getitd.py", line 1110, in set_representative
self.rep = self.rep.calc_vaf()
File "/opt/getitd-1.2.1/getitd.py", line 552, in calc_vaf
assert self.vaf >= 0 and self.vaf <= 100
AssertionError
Many thanks for these precious answers! Have a nice day,
Augustin
Sorry for the CAPS lock ...
Hi Augustin,
I'm glad the parameters are already working better! As for the assertion error, that is likely related to the fix I just pushed to the develop branch. Could you try that one?
I'll try it as soon as I get out of the meeting. Do you think that the setting -infer_sense_from_alignment False will improve the results?
That depends on whether your reads are already sorted the way getITD expects it or not. By default, when the parameter is set to False
, getITD will align R1 reads directly to the reference, whereas it will reverse-complement R2 reads and then align them. If that's how your reads are sorted into FASTQ files, the parameter should have no effect. If your reads are not sorted in this way, half of them may fail alignment, as the wrong ones are reverse-complemented and vice versa. Setting -infer_sense_from_alignment True
will have getITD align each read twice, both pre and post reverse-complementing it, and always keep the one which achieves a higher score. This way, getITD can automatically determine which way around reads should be aligned. I'm just not sure how SureSelect reads will be sorted into FASTQ files.
Hi tjblaette,
Sorry for the delay ! Yesterday was a busy day :( ... I restarted on the develloper branch and it works I have no more error when the -infer_sense_from_alignment False parameter
Currently I get : With all the parameters you gave me -infer_sense_from_alignment True 41 ITDs out of 145 ITDs -infer_sense_from_alignment False 41 ITDs out of 145 ITDs
Do you think these results are due to my low read depth? I don't think the problem is due to my fastq's. I've tested other algorithms without any problem.
Thank you for your time.
Have a nice day,
Augustin.
Hi Augustin,
it's hard to say without seeing the data. With the parameters that I have suggested, getITD would require two distinct supporting reads to call an ITD as high-confidence. (ITDs supported by only a single read will be listed in the itd*trailing.tsv
file.)
Some things to consider
amplicon.txt
target region? It could be that your samples would benefit from a longer reference that extends into the adjacent introns, depending on your read distribution. Possibly getITD cannot align some of the reads overlapping exon borders because they don't map enough onto the amplicon.txt
region?Can you give me the full command you're using now, just to double-check parameters are set correctly?
All the best, Tamara
Thank you for those accurate answers!
The ITDs found with getITD have a range of [6;111] with the reference method (fragment analysis) [6;213] I enclose a histogram of the lengths. https://ibb.co/1ngbZmZ
By comparing amplicon.txt with other references that I use with other algorithms I was able to make an estimate of where these ITDs are located in amplicons.txt. They range from position 51 to 284.
During the Aligning to Reference step I lose a lot of read do you think I should modify amplicon.txt ? eg : """ ==== PROCESSING SAMPLE P20-L1702218D01_S20 ==== -- Reading FASTQ files -- Number of total reads: 82158 Number of total reads remainging after N-trimming: 74218 (90.33569463716255 %) Mean read length after N-trimming: 100.78386644749253 Number of total reads with mean BQS >= 30: 66372 (80.78580296501862 %) Number of unique reads with mean BQS >= 30: 12269 Turned OFF unique reads filter! Total reads remaining for analysis: 66372 (80.78580296501862 %)
-- Aligning to Reference -- Inferring sense from alignment! Filtering 11590 / 12269 low quality alignments with a score < 50.0 % of max Turned OFF indel-free primer filter! Total reads remaining for analysis: 4639 (5.646437352418511 %) """" By the way for some negative patients the file containing the different output files contains only the folder out_needle and nothing else is this normal?
No difference the samples are treated in the same way and as I answered you on the previous points I don't find that the lengths or positions differ much.
Of course :) my full command is : """ python3 getitd.py -nkern ${task.thread} -require_indel_free_primers True -min_read_copies 1 -infer_sense_from_alignment True $sample_name ${sample_name}_R1.fastq ${sample_name}_R2.fastq """
Many thanks for all
Augustin.
Hi Augustin,
thanks for all of those details. A few things to note:
-min_read_length
to something below 100 to include shorter trimmed reads as well, though their ability to detect long insertions will of course be limited.-min_bqs
to something below the default 30, but its' probably not a proper solution. Even with default settings we sometimes see spurious false positives come up in some of our lower-quality samples (especially many different short 6-7 bp insert sequences detected as trailing {-1}
ITDs).amplicon.txt
, never at the border. There are about 50bp at each end where no ITD was found? We need to check the coverage of the amplicon, to the see whether reads are not properly aligned closer to the exon junction region. See the bottom of my reply for an update to help with this.amplicon.txt
region within FLT3? In other words, how many reads are you expecting to align with getITD? Should it be 100%?out_needle
, stats.txt
and config.txt
. The ins*
and itd*
files will only be present if insertions or itds were found. Please let me know if the stats and config files are really missing. In that case, there's a problem with getITD.There are two minor issues with your command and the example output you showed. Your command has require_indel_free_primers
set to True
instead of False
. The example output seems to filter for an alignment score of 50%, whereas the current version of getITD should filter for 40%.
To check coverage, I have just pushed another update: With the newest version, v1.5.6, an additional "coverage.txt" file will the saved for each sample. If you use -plot_coverage True
, the data will additionally be plotted to coverage.png
(you must install the python module matplotlib
for this to work). Try running some of your samples with plotting enabled and check the coverage. If you could share a representative plot, that would be nice too. I'm curious what SureSelect looks like!
All the best, Tamara
Hi Tamara,
Sorry for the delay, I don't have access to my data on weekends ^^.
Indeed that explains for 12 of my ITDs.
Yes I think I know the cause: From the BAM I extract the fastqs covering the FLT3 region and the unmapped reads. This important quantity of N must come from the unmapped reads I think. I will restart with your setting -min_read_length 60.
Ok I'll try with -min_bqs 20
Concerning the coverage I'll try with your settings and come back to you (-plot_coverage True).
No my FASTQs contain the FLT3 region and the unmapped reads only. I made a preliminary filter not to overwork getITD.
I just checked for some patients the folder contains only the out_needle (no stats.txt nor config.txt)
Ok I'll try your update. I'll get back to you with the graphics.
My new command : """ python3 /opt/getitd-develop/getitd.py -nkern ${task.thread} -require_indel_free_primers False -min_read_copies 1 -min_read_length 60 -min_bqs 20 -infer_sense_from_alignment True $sample_name ${sample_name}_R1.fastq ${sample_name}_R2.fastq """
Once again a big thank you for your help! :)
Have a nice day.
Augustin.
Hi Augustin,
command looks good. Could you share the command and log for the samples where getITD failed to produce the stat and config files? Could you try rerunning one of those and making sure there is no error? Config and stats files are created before the out_needle folder, so I'm not sure why those two would be missing when the out_needle directory is there.
You could also try running getITD only on the reads expected to map to the two exons in the getITD target region. That would allow us to troubleshoot alignment issues better.
Looking forward to the coverage plots!
Tamara
Hi Tamara,
I created a nextflow pipeline that treats all samples in the same way. I've linked you the tree structure of my results file : https://drive.google.com/file/d/1VhMxR57z57KLiLOedK1r_6ZixrpeuxOo/view?usp=sharing For some patients not just the out_needle :/ Relaunch gives the same results.
You could also try running getITD only on the reads expected to map to the two exons in the getITD target region. That would allow us to troubleshoot alignment issues better. I'm not sure I understood you asking me to filter more precisely? i.e. the genomic position Exon14 + Exon 15 only? without the unmapped reads. But if getITD filters the unmapped reads not corresponding to FLT3 it shouldn't affect the sequel, right ?
All of my cover graphs are in this file: https://drive.google.com/drive/folders/1Wk1SQ66kDr_nMRSRV9ITJB2dVa8JlbAG?usp=sharing only a part has been generated I feel like
Thanks a lot for your help
Have a nice day.
Augustin
Hi Augustin,
Could you just manually rerun just one of the samples where getITD fails to create the stats and config files? If that still creates the out_needle folder and no stats/config, could you share the command, the log and the FASTQ files? I will try to troubleshoot it then. Just one sample, or even a portion of it will be fine, if it recreates the error.
I will have a look at your coverage data later. Thanks for sharing!
Hi Tamara,
I don't understand! in manual I don't have any missing documents ... I have been trying since this afternoon to understand where the problem comes from. As soon as I know, I'll come back to you.
For the coverage graphics it seems ok to me?
Hi Tamara,
You have succeeded with your settings has recovered almost all the patients: 103/114 patients and 125 ITDs out of 145 ITDs. The correlation between log(M/WT) fragment and log(ar) getITD R=0.57 p = >0.001%.
Sorry for my mistake with empty folder... a story of a symbolic link badly parameterized.
I'm going to pass negative patients due to the modification of parameterizations and I'll come back to you! with a specificity!
Many thanks :)
Have a nice day,
Augustin
Hi Augustin,
that's great news! I'm glad you mentioned the empty folders so that we could sort this out as well!
The number of ITDs found is now quite good, since we know that some are too large to be found with the short read length. Feel free to share a sample or two where a smaller ITD is still not found and you're not sure why. I can take a look then.
As for the correlation, this seems a bit low? In our publication we correlated VAFs from getITD with VAFs calculated from the AR reported by traditional fragment analysis and achieved R^2 > 0.95. Do you get better results correlating VAFs directly instead of log-scaled VAF and AR? (VAF = ar/(ar + 1) * 100 )
Hi Tamara,
Yes, I am still sorry for this mistake.
Thank you for this proposal! I'm preparing a folder on my drive with the filtered fastq (FLT3 + unmapped read) with an array of values corresponding to the fragment.
Yes the correlation coefficient is quite low :/. I had put it down to our read depth. The problem is that our results in fragment analysis are rendered in M/WT (according to the ENL 2017 recommendations), so I don't have the VAF ... Nevertheless, while searching by computation I had determined from the getitd output: -counts = read mutated -ar = counts / (coverage - counts) which would be equivalent to an M/WT?
I have another problem: I have analyzed my 71 negative patients and with the modifications of the settings, they all come out positive (70/71 False positive) :/ has ratios sometimes > 0.01. If I increase -min_read_copies 10 I lose 10 ITDs among the positive ones :/. Can we play on other parameters in order to compensate for this false positive bp ?
Thank you very much for your help.
Augustin.
Here is the file containing the few missing patients: https://drive.google.com/drive/folders/13D_0298Ufqx9OXYD7yaH7rUguYLuYB2Z?usp=sharing
Hi Augustin,
that is indeed a problem. What samples are you using as your negative controls? Are these healthy volunteers or FA-negative AML patients (i.e. could some of them at least potentially harbor a low allelic ITD?)?
getITD is quite sensitive to low sequencing quality when it comes to false positives, especially with some of the non-default parameters we are using for your data. As mentioned previously, what we have seen in some of our lower quality samples is many different trailing ITDs with a 6-7bp insert supported only by {-1}
reads. Is this what you see? If it is, it might be worth filtering for longer insert sequences (min 9bp). This cannot be done automatically with getITD yet, but I can add the parameter if needed. The other thing that might help, which I have been meaning to implement, is more flexible read trimming: Instead of only taking out terminal Ns, I was thinking about trimming terminal bases below a certain base quality score. Have you tried setting the overall BQS filter back to 30? Like I said earlier, I wouldn't generally recommend lowering that cutoff. If that doesn't help and you can share some of the false positive controls, I can take a look myself as well.
The other possibility is multiplexing contamination. Were the control samples sequenced together with ITD-positive samples? If so, were these multiplexed using single or double unique indices/barcodes? Illumina reports up to 0.2% of incorrectly demultiplexed reads per sample for single-barcoded samples (see white paper here), which is why we use double-unique indices for our assay AND run negative control and post-treatment patient samples with low or no ITD levels present always separate from diagnostic/relapse samples. If the ITDs you are seeing in your control samples are caused by multiplexing issues, you should see the same ITD in your negative controls AND an actual patient that was sequenced on the same run.
Also, are you seeing likely false positives in your actual patients? If it's a general problem that is causing this many false positives, I would assume you see them in all samples. Is there a difference in coverage or sequencing quality between your patients and your controls?
I hope this helps, Tamara
Hi Tamara,
that is indeed a problem. What samples are you using as your negative controls? Are these healthy volunteers or FA-negative AML patients (i.e. could some of them at least potentially harbor a low allelic ITD?)? My samples are FA-negative AMLs. I also thought it was also low ratio (FA limit of quantification 1%) but even with an after filter >0.01 70 patients are positive. These are insertions but they are very often multiples of 3 and according to the literature a multiple insertion of 3 has the same clinical consequence as a duplication (gain of function for the protein with constitutional activation of the receptor).
getITD is quite sensitive to low sequencing quality when it comes to false positives, especially with some of the non-default parameters we are using for your data. As mentioned previously, what we have seen in some of our lower quality samples is many different trailing ITDs with a 6-7bp insert supported only by {-1} reads. Is this what you see? If it is, it might be worth filtering for longer insert sequences (min 9bp). This cannot be done automatically with getITD yet, but I can add the parameter if needed. The other thing that might help, which I have been meaning to implement, is more flexible read trimming: Instead of only taking out terminal Ns, I was thinking about trimming terminal bases below a certain base quality score. Have you tried setting the overall BQS filter back to 30? Like I said earlier, I wouldn't generally recommend lowering that cutoff. If that doesn't help and you can share some of the false positive controls, I can take a look myself as well. Not really these are not small ITDs lengths the range goes from [12bp -63bp]. No I'll try to set the BQS back to 30 ;) I'll raise it again and get back to you. I'll add some fastqs with a high ratio to our file on the drive. I also added my results table of my 71 negative patients fragments.
The other possibility is multiplexing contamination. Were the control samples sequenced together with ITD-positive samples? If so, were these multiplexed using single or double unique indices/barcodes? Illumina reports up to 0.2% of incorrectly demultiplexed reads per sample for single-barcoded samples (see white paper here), which is why we use double-unique indices for our assay AND run negative control and post-treatment patient samples with low or no ITD levels present always separate from diagnostic/relapse samples. If the ITDs you are seeing in your control samples are caused by multiplexing issues, you should see the same ITD in your negative controls AND an actual patient that was sequenced on the same run. Yes, these are routine runs at the hospital so it is possible that these neg patients may have passed with positive patients. I don't think it's a multiplexing problem. In fact, we passed a series of other algorithms that do not output these false positives :/ has ratios > 0.01. I'm currently checking the runs to see if some patients are close to other positives and I'll get back to you ;).
Also, are you seeing likely false positives in your actual patients? If it's a general problem that is causing this many false positives, I would assume you see them in all samples. Is there a difference in coverage or sequencing quality between your patients and your controls? Yes there are also many more ITDs among my positive patients but since the patients were positive I didn't worry but for the AF Negative it's problematic.
I will try this new order on my 2 pools of patients after your precious advice: """ python3 /opt/getitd-master/getitd.py -nkern ${task.thread} -require_indel_free_primers False -min_read_copies 10 -min_read_length 60 -min_bqs 30 -infer_sense_from_alignment True $sample_name ${sample_name}_R1.fastq ${sample_name}_R2.fastq """
Thanks for everything :)
Have a nice day,
Augustin
Hi Augustin,
the VAF reported by getITD is already in percent. Thus, if the data was sequenced using single unique sample barcodes for multiplexing, you should use -filter_ins_vaf 0.2
. If you wanted to filter for ITDs with a minimum VAF of 1% to match the sensitivity of FA, you would have to use -filter_ins_vaf 1
.
Also note that getITD does not report inserts whose length is not a multiple of 3, unless they are "trailing" and presumably not sequenced fully. Thus, most of the reported insertions will be in-frame simply by design.
As for the false positives: I don't think I have yet seen a false positive non-trailing ITD reported by getITD. It is the trailing ITDs which are potentially less trustworthy, especially when the insert (column 9 of the output tsv files) is very short (< 9 bp). We did see in our samples also many more ITDs than were reported by FA previously. However, we believe that these are mostly true positives because i) we found the same additional ITDs in technical replicates of the same sample which were independently prepped and sequenced, ii) we found the same additional ITDs in bone marrow and peripheral blood samples in patients where both were independently prepped and sequenced and iii) we found the same additional ITDs in post-treatment follow-up samples which were prepped and sequenced independently of the original diagnostic sample. I would honestly not be surprised if low-allelic ITD clones could be found in other AML patients as well. So if the additional clones you see are non-trailing, I would assume they are mostly fine. If you see mostly trailing ITDs, especially with short insert sequences (column 9) and reverse read support only (column 10), that would be worrysome.
I'm not sure -min_read_copies 10
is the correct parameter. min_read_copies 10
will filter out all reads that do not occur at least 10 times in each sample. Our default is 2, which means we discard unique reads. Since SureSelect is not based on a few specific amplicons, I would rather set this to 1. 10 seems quite stringent. If you wanted to filter for a minimum number of total reads supporting a certain insertion, you can use -filter_ins_total_reads
. That's a good idea actually. Try setting -filter_ins_total_reads 4
to mimic getITD's default of requiring two distinct reads with at least two copies each. You can also set it to a higher number, but filtering for VAF is probably better, as it is less sensitive to variations in coverage.
Thanks for adding the files, I will have a look! Tamara
Hi Tamara,
For the false positives I understand well for the low ratios I applied a filter to eliminate them after the analysis. Nevertheless for an ITD of length 45 with a VAF at 30% M/WT at 0.4 should be detected in GS.
You are right for min_read_copies 1. I will stay on VAF filtering then.
I'm also looking on my side ;)
Thanks for your help.
Have a nice day,
Augustin
Hi Augustin,
I have run through the samples you sent me and checked the ITDs that getITD reports.
My command was for file in L*R1*; do OUT=$(echo $file | cut -f1 -d'_'); echo $OUT; python getitd/getitd.py -plot_coverage true -min_read_copies 1 -min_read_length 60 -nkern 10 -require_indel_free_primers false -infer_sense_from_alignment true $OUT ${OUT}*fastq; done
. (I forgot to add -min_ins_vaf 0.2
)
I first noticed that a few times there are two entries describing the same ITD, one as trailing, one as not trailing. This is because getITD currently does not merge trailing with non-trailing ITDs. (I was afraid small inserts would spuriously be merged into larger ones, and in our assay this is not a problem because ITDs are very rarely discovered as both trailing and non-trailing. In your assay, it occurs very often. I will have to see if there is a good way to deal with these double-entries.)
Looking at the specific ITDs themselves, the non-trailing ones look perfectly valid to me. I manually checked some of the supporting alignments and they are definitely ITDs. See attached examples for 27bp 16% VAF and 36bp 3% VAF ITDs in L1702796D01 as well as the 15bp 11% VAF ITD in L1805186D05. I don't know why GS did not detect these, but based on those reads, getITD was right in calling them. The 36bp ITD was also called as a trailing variant, which is also fine.
The two trailing ITDs in L1505897D11 (104bp 5% VAF) and L1906424D03 (97bp 0.5% VAF) look less convincing and could very well be false positives. For getITD they are edge cases, called because they could be true ITDs with sequencing errors and additional insert bases between the two tandems, or spurious alignments (which in these cases is probably more likely though). Possibly extending the reference sequence further into the intron might help. Have you tried this? (I can prepare a longer reference if you need help with that!)
All the best, Tamara
L1702796D01_27bp-non-trailing:
L1702796D01_36bp-non-trailing:
L1805186D05_15bp-non-trailing:
L1906424D03_97bp-trailing:
L1505897D11_104bp-trailing:
Hi Tamara,
Thank you for these explanations for the positives. It is the most negative patients in FA that worries me :/. I have put inside our file a "p_neg" folder including some fastq of FPs as well as the summary table for all the negative patients.
I will try with a longer reference and come back to you.
Thank you for your help,
Have a nice day,
Augustin.
Hi Augustin,
I have run through the negative samples you have provided but for me there are no high-confidence ITDs reported, so getITD does seem to handle these perfectly fine. There is no need to worry about the insertions in ins*hc*
. If you look at the coordinates, they all start at the very beginning (start=1) or end (start=329) of the amplicon and will be "overhanging" reads that go beyond the reference amplicon. These will always be there for SureSelect data, where reads are not limited to our small target region.
The command I used is for file in P*R1*; do OUT=$(echo $file | cut -f1 -d'_'); echo $OUT; python ../../getitd/getitd.py -plot_coverage true -min_read_copies 1 -min_read_length 60 -nkern 10 -require_indel_free_primers false -infer_sense_from_alignment true -filter_ins_vaf 0.2 $OUT ${OUT}*fastq; done
sample length start vaf ar coverage counts trailing seq sense
P22-L1810262D11 6 329 0.36998 0.0037135 1892 7 False GTACAG {-1}
P22-L1810262D11 21 1 0.89552 0.0090362 335 3 False GAGACTTGTGATCAAGAGACA {1}
P22-L1810262D11 27 1 1.4925 0.015151 335 5 False TAACTGACTCATCATTTCATCTCTGAA {1}
P22-L1810262D11 30 1 0.59701 0.0060060 335 2 False TCCTAACTGACTCATCATTTCATCTCTGAA {1}
P22-L1810262D11 33 1 1.4925 0.015151 335 5 False TATTCCTAACTGACTCATCATTTCATCTCTGAA {1}
P22-L1810262D11 45 329 60.840 1.5536 2620 1594 False GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGG {1, -1}
P22-L1810262D11 49 1 65.548 1.9026 894 586 True TATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA {1, -1}
P22-L1810262D11 54 1 2.3881 0.024465 335 8 False TCCTCTATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA {1}
P22-L1810262D11 63 329 0.47569 0.0047797 1892 9 False GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGGAGGCACAGTTTCCCACCC {-1}
sample length start vaf ar coverage counts trailing seq sense
P23-L1507694D15 6 329 0.25088 0.0025151 1993 5 False GTACAG {-1}
P23-L1507694D15 30 329 0.65228 0.0065656 1993 13 False GTACAGTATAGTGGAAGGACAGCAACAAAG {-1}
P23-L1507694D15 45 329 55.916 1.2684 2679 1498 False GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGG {1, -1}
P23-L1507694D15 49 1 59.821 1.4889 1003 600 True TATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA {1, -1}
P23-L1507694D15 54 1 3.0445 0.031401 427 13 False TCCTCTATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA {1}
P23-L1507694D15 54 329 0.60211 0.0060576 1993 12 False GTACAGTATAGTGGAAGGACAGCAACAAAGATGCACAAAAATGGGAGGCACAGT {-1}
sample length start vaf ar coverage counts trailing seq sense
P85-L1808763D05 14 1 75.258 3.0417 194 146 True ATTTCATCTCTGAA {1, -1}
P85-L1808763D05 14 329 0.41347 0.0041518 1693 7 True GTCCTTTTCTCTGC {1}
P85-L1808763D05 33 329 0.70886 0.0071392 1975 14 False GTACAGTATAGTGGAAGGACAGCAACAAAGATG {-1}
P85-L1808763D05 42 1 1.4925 0.015151 134 2 False AGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA {1}
P85-L1808763D05 45 329 80.221 4.0559 2447 1963 False GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGG {1, -1}
sample length start vaf ar coverage counts trailing seq sense
P91-L1708688D04 12 329 0.26017 0.0026085 4228 11 False GTACAGTATAGT {-1}
P91-L1708688D04 14 1 55.885 1.2668 2328 1301 True ATTTCATCTCTGAA {1, -1}
P91-L1708688D04 21 1 0.50505 0.0050761 990 5 False GAGACTTGTGATCAAGAGACA {1}
P91-L1708688D04 21 1 3.0303 0.031250 990 30 False ACTCATCATTTCATCTCTGAA {1}
P91-L1708688D04 30 329 0.82781 0.0083472 4228 35 False GTACAGTATAGTGGAAGGACAGCAACAAAG {-1}
P91-L1708688D04 33 329 0.47304 0.0047529 4228 20 False GTACAGTATAGTGGAAGGACAGCAACAAAGATG {-1}
P91-L1708688D04 36 1 0.50505 0.0050761 990 5 False GCCTATTCCTAACTGACTCATCATTTCATCTCTGAA {1}
P91-L1708688D04 45 329 53.205 1.1370 5445 2897 False GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGG {1, -1}
P91-L1708688D04 54 1 1.3131 0.013306 990 13 False TCCTCTATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA {1}
I have also just pushed another update to getITD, which fixes a small bug in VAF calculation. Be sure to check it out and let me know if you have any other questions or problems that I can help you with!
All the best, Tamara
Thank you for everything Tamara!
I'm sorry I was in meetings all day Friday, I'm watching the VAF correction on Monday!
Augustin.
Hi tamara! It's me again ^^
I've run patients from a new cohort and I have a new error that never appeared before : """ python3 /opt/getitd-develop/getitd.py -nkern 12 -require_indel_free_primers False -min_read_copies 1 -min_read_length 60 -min_bqs 20 -infer_sense_from_alignment True P16-L1305872D01_S16 P16-L1305872D01_S16_R1.fastq P16-L1305872D01_S16_R2.fastq
Command exit status:
1
Command output:
getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.
getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.
getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.
getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.
getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.
getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.
getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.
getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.
Collecting ITDs took 16.399568350985646 s
367 ITDs were found
-- Merging insertions --
422 insertions remain after merging
215 insertions remain after merging
197 insertions remain after merging
32 insertions remain after merging
-- Merging itds --
54 itds remain after merging
Command error:
Traceback (most recent call last):
File "/opt/getitd-master/getitd.py", line 2193, in
main(config)
File "/opt/getitd-master/getitd.py", line 2170, in main
merged_ins_anditds[type] = get_mergedinserts(inserts, type_, iref_coverage, config)
File "/opt/getitd-master/getitd.py", line 1861, in get_merged_inserts
save_to_file([insert.rep for insert in tomerge], type + "_collapsed-" + suffix + ".tsv", config)
File "/opt/getitd-master/getitd.py", line 1594, in save_to_file
to_save = [insert.prep_for_save(config) for insert in inserts]
File "/opt/getitd-master/getitd.py", line 1594, in
to_save = [insert.prep_for_save(config) for insert in inserts]
File "/opt/getitd-master/getitd.py", line 1087, in prep_for_save
assert to_save.end <= len(config["REF"])
AssertionError
"""
My coverage for my patients seems fine ?
Thanks for your help. Sorry to come back with problems :(
Have a nice day,
Augustin
Hi Augustin,
Can you share that specific sample? I'll look into it then!
All the best, Tamara
Of courses :) Many many thanks
I have just added it to our folder "assert_error"
Hi Augustin,
I hope I've got it fixed. I'm running some tests overnight and if it looks ok, I will push to develop
tomorrow.
And sorry for the back and forth! It seems SureSelect is a bit more tricky than I had anticipated. I'm sure we'll get it sorted though!
All the best, Tamara
Hi Tamara,
No worries :)
Thanks to you for helping me! I look forward to your update.
Have a nice day,
Augustin.
Hi Augustin,
I have just pushed v1.5.10 to develop
. It's only run back through some of our samples, but it's looking good so far. Please let me know how it goes and whether you encounter any problems with it!
All the best, Tamara
Hi Tamara,
Yeah :) I can't wait to try it tomorrow! Thanks for your help.
good night,
Augustin.
Hi Tamara,
Well done :) the first part of my new cohort seems to be working. I'll have the rest of the results within the next week!
Thank you very much!
Augustin.
Hi Augustin,
that is wonderful news! I was a bit scared at first when I saw your message. :D Please let me know how it goes, I'll keep my fingers crossed!
Have a good weekend, Tamara
Hi,
First of all thank you for this algorithm which I hope will solve the problem of ITDs in NGS.
I'm doing a comparative study between the different algorithms in this problem. Currently, I can't get conclusive results with getITD by comparing it to the fragment method on several runs of NGS. If eventually I publish this comparison on the subject, I'm bored to underestimate your algorithm...
My NGS pipeline: SureSelect-QXT AGILENT® capture enrichment on a set of 81 genes (including FLT3-ITD) then paire-end sequencing on a MiSeq ILLUMINA® with a chemistry of 2*120bp.
Is it possible to use your algorithm in my case ? if yes, do you have some advice to give me on the parameter ?
Thank you in advance for your time.
Augustin.