Unexpected results - Githubissues

BoudSTER commented 3 years ago

Hi,

First of all thank you for this algorithm which I hope will solve the problem of ITDs in NGS.

I'm doing a comparative study between the different algorithms in this problem. Currently, I can't get conclusive results with getITD by comparing it to the fragment method on several runs of NGS. If eventually I publish this comparison on the subject, I'm bored to underestimate your algorithm...

My NGS pipeline: SureSelect-QXT AGILENT® capture enrichment on a set of 81 genes (including FLT3-ITD) then paire-end sequencing on a MiSeq ILLUMINA® with a chemistry of 2*120bp.

Is it possible to use your algorithm in my case ? if yes, do you have some advice to give me on the parameter ?

Thank you in advance for your time.

Augustin.

tjblaette commented 3 years ago

Hi Augustin,

thanks for testing getITD and thanks checking in with me! I will have a look at the protocol and get back to you with some recommendations. Can you roughly estimate the coverage your assay achieves on the FLT3 target region?

It would be great if getITD worked on existing panel workflows as well!

All the best, Tamara

BoudSTER commented 3 years ago

Hi tjblaette,

Thank you for your response.

On average over the FLT3-ITD region (chr13:28577411-28674713) I have 117868 reads. I am very far from your milion ^^.

Have a nice day :).

Augustin.

tjblaette commented 3 years ago

Hi Augustin,

there are two important parameters to adjust as follows:

-require_indel_free_primers False: We don't have specific primers in the SureSelect protocol, so we cannot filter on these.
-min_read_copies 1: Sequencing reads start at variable positions in the target region, so there may not always be exact duplicates of each read, especially as your coverage is much lower than for our assay. Just be aware that this could make getITD significantly slower, especially at the alignment stage.

Both of these parameters normally help to reduce false positives, so you might have to adjust filter parameters at the end, such as -filter_ins_vaf.

One parameter that I'm not sure about is -infer_sense_from_alignment. You can keep it at False if your R1 FASTQ file reads all map directly to the reference amplicon while your R2 reads have to be reverse complemented to map. Setting it to True might make things slower, but keeping it at False when it shouldn't be can erroneously discard up to half of your reads. If you are not sure, try both ways and see if it makes a difference in the number of reads that pass the alignment stage.

Also, while it should be possible to process complete FASTQ files, I recommend extracting the relevant reads first, as getITD will otherwise likely run for an extremely long time. These will be the reads that were successfully aligned to the FLT3 target region by another aligner such as BWA, as well as the ones that could not be aligned at all. Please let me know if you need help with this. Chromosomal coordinates for the default reference section can be found in the anno/amplicon_kayser.tsv file. Depending on your read distribution, it might be worth trying a larger reference. In that case, extent anno/amplicon.txt and anno/amplicon_kayser.tsv as necessary.

Please let me know how these work for you and whether there is anything else I can do to help. Were you having any specific issues so far? The title mentions "unexpected results"?

All the best, Tamara

BoudSTER commented 3 years ago

Hi tjblaette,

After your modifications it works better !

Initially using getitd "basically": Out of 145 positive ITDs with the reference method getitd finds 18.

After your modifications of the settings: -infer_sense_from_alignment True Out of 145 positive ITDs with the getitd reference method, 41 are found.

-infer_sense_from_alignment False getitd raise an error Command output:
Turned OFF indel-free primer filter!
Total reads remaining for analysis: 1697 (2.1798330122029546 %)
Calculating coverage took 0.12398674301221035 s

-- Looking for insertions & ITDs --
Collecting inserts took 0.10209453999414109 s 83 insertions were found
0/83 insertions were part of adapters and filtered
Filtering inserts for adapter sequences took 0.11129739001626149 s
Annotating coverage took 0.006243465992156416 s

Found an ITD too long for the respective read (likely false positive):
ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.

ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.

Found an ITD too long for the respective read (likely false positive):
ITD's WT tandem is not fully covered by the read and the ITD was therefore not called.
Report this warning if there too many; otherwise safely ignore.

Collecting ITDs took 0.5573262049874756 s
35 ITDs were found

-- Merging insertions --
47 insertions remain after merging
31 insertions remain after merging
27 insertions remain after merging

Command error:
Traceback (most recent call last):
File "/opt/getitd-1.2.1/getitd.py", line 2050, in
main(config)
File "/opt/getitd-1.2.1/getitd.py", line 2027, in main
merged_ins_anditds[type] = get_mergedinserts(inserts, type_, ref_coverage, config)
File "/opt/getitd-1.2.1/getitd.py", line 1797, in get_merged_inserts
to_merge = merge(to_merge, condition, ref_coverage, config)
File "/opt/getitd-1.2.1/getitd.py", line 1513, in merge
minsert_collection = minsert_collection.merge(insert_collection, ref_coverage)
File "/opt/getitd-1.2.1/getitd.py", line 1127, in merge
self = self.set_representative()
File "/opt/getitd-1.2.1/getitd.py", line 1110, in set_representative
self.rep = self.rep.calc_vaf()
File "/opt/getitd-1.2.1/getitd.py", line 552, in calc_vaf
assert self.vaf >= 0 and self.vaf <= 100 AssertionError

Many thanks for these precious answers! Have a nice day,

Augustin

BoudSTER commented 3 years ago

Sorry for the CAPS lock ...

tjblaette commented 3 years ago

Hi Augustin,

I'm glad the parameters are already working better! As for the assertion error, that is likely related to the fix I just pushed to the develop branch. Could you try that one?

BoudSTER commented 3 years ago

I'll try it as soon as I get out of the meeting. Do you think that the setting -infer_sense_from_alignment False will improve the results?

tjblaette commented 3 years ago

That depends on whether your reads are already sorted the way getITD expects it or not. By default, when the parameter is set to False, getITD will align R1 reads directly to the reference, whereas it will reverse-complement R2 reads and then align them. If that's how your reads are sorted into FASTQ files, the parameter should have no effect. If your reads are not sorted in this way, half of them may fail alignment, as the wrong ones are reverse-complemented and vice versa. Setting -infer_sense_from_alignment True will have getITD align each read twice, both pre and post reverse-complementing it, and always keep the one which achieves a higher score. This way, getITD can automatically determine which way around reads should be aligned. I'm just not sure how SureSelect reads will be sorted into FASTQ files.

BoudSTER commented 3 years ago

Hi tjblaette,

Sorry for the delay ! Yesterday was a busy day :( ... I restarted on the develloper branch and it works I have no more error when the -infer_sense_from_alignment False parameter

Currently I get : With all the parameters you gave me -infer_sense_from_alignment True 41 ITDs out of 145 ITDs -infer_sense_from_alignment False 41 ITDs out of 145 ITDs

Do you think these results are due to my low read depth? I don't think the problem is due to my fastq's. I've tested other algorithms without any problem.

Thank you for your time.

Have a nice day,

Augustin.

tjblaette commented 3 years ago

Hi Augustin,

it's hard to say without seeing the data. With the parameters that I have suggested, getITD would require two distinct supporting reads to call an ITD as high-confidence. (ITDs supported by only a single read will be listed in the itd*trailing.tsv file.)

Some things to consider

with 120 bp reads, getITD won't be able to find large inserts. Are the ITDs it's missing large, or is it smaller ones too which are not detected? Missing smaller ones would point to a different problem.
are all the missed ITDs, and their supporting reads, within the default amplicon.txt target region? It could be that your samples would benefit from a longer reference that extends into the adjacent introns, depending on your read distribution. Possibly getITD cannot align some of the reads overlapping exon borders because they don't map enough onto the amplicon.txt region?
looking at the stats getITD reports, is there a specific step that you seem to be losing many reads at?
is there a difference between the ITDs that are detected and the ones which are not detected, such as length or position? Is there a difference between the samples in which an ITD is detected and the ones it's not detected, like coverage or quality?

Can you give me the full command you're using now, just to double-check parameters are set correctly?

All the best, Tamara

BoudSTER commented 3 years ago

Thank you for those accurate answers!

with 120 bp reads, getITD won't be able to find large inserts. Are the ITDs it's missing large, or is it smaller ones too which are not detected? Missing smaller ones would point to a different problem.

The ITDs found with getITD have a range of [6;111] with the reference method (fragment analysis) [6;213] I enclose a histogram of the lengths. https://ibb.co/1ngbZmZ

are all the missed ITDs, and their supporting reads, within the default amplicon.txt target region? It could be that your samples would benefit from a longer reference that extends into the adjacent introns, depending on your read distribution. Possibly getITD cannot align some of the reads overlapping exon borders because they don't map enough onto the amplicon.txt region?

By comparing amplicon.txt with other references that I use with other algorithms I was able to make an estimate of where these ITDs are located in amplicons.txt. They range from position 51 to 284.

looking at the stats getITD reports, is there a specific step that you seem to be losing many reads at?

During the Aligning to Reference step I lose a lot of read do you think I should modify amplicon.txt ? eg : """ ==== PROCESSING SAMPLE P20-L1702218D01_S20 ==== -- Reading FASTQ files -- Number of total reads: 82158 Number of total reads remainging after N-trimming: 74218 (90.33569463716255 %) Mean read length after N-trimming: 100.78386644749253 Number of total reads with mean BQS >= 30: 66372 (80.78580296501862 %) Number of unique reads with mean BQS >= 30: 12269 Turned OFF unique reads filter! Total reads remaining for analysis: 66372 (80.78580296501862 %)

-- Aligning to Reference -- Inferring sense from alignment! Filtering 11590 / 12269 low quality alignments with a score < 50.0 % of max Turned OFF indel-free primer filter! Total reads remaining for analysis: 4639 (5.646437352418511 %) """" By the way for some negative patients the file containing the different output files contains only the folder out_needle and nothing else is this normal?

is there a difference between the ITDs that are detected and the ones which are not detected, such as length or position? Is there a difference between the samples in which an ITD is detected and the ones it's not detected, like coverage or quality?

No difference the samples are treated in the same way and as I answered you on the previous points I don't find that the lengths or positions differ much.

Of course :) my full command is : """ python3 getitd.py -nkern ${task.thread} -require_indel_free_primers True -min_read_copies 1 -infer_sense_from_alignment True $sample_name ${sample_name}_R1.fastq ${sample_name}_R2.fastq """

Many thanks for all

Augustin.

tjblaette commented 3 years ago

Hi Augustin,

thanks for all of those details. A few things to note:

the longest ITD that getITD can detect is always 6 bp shorter than the read length. In your case, the longest possible ITD would be 114 bp long, which is quite close to the 111 bp that you report. This explains at least a few of the ITDs which are not detected.
in your example, I actually see a high number of reads lost at the quality filtering before alignment. I'm surprised you are using Illumina but have a significant number of Ns in your sequences as it seems? 10% of reads are filtered, for the example you showed above, because they contain more than 20 N bases at their (3'?) ends. Do you know what's causing this? (What happens at this stage is that getITD trims off terminal N bases and discards reads shorter than 100 bp. Your reads are 120 bp long, so up to 20 Ns could be trimmed without an entire read being lost.) We normally expect almost all reads, > 99%, to pass N-trimming. Only 90% of reads pass this filter for your example though. You could try setting -min_read_length to something below 100 to include shorter trimmed reads as well, though their ability to detect long insertions will of course be limited.
I also see another 10% of reads filtered at the base quality score cutoff, indicating that even after trimming of terminal Ns, read quality is low in some sequences. Is the quality in this sample comparable to the others in your cohort? We typically see >90% of reads passing the BQS filter. You could try setting -min_bqs to something below the default 30, but its' probably not a proper solution. Even with default settings we sometimes see spurious false positives come up in some of our lower-quality samples (especially many different short 6-7 bp insert sequences detected as trailing {-1} ITDs).
regarding alignment, I see two things: ITDs are only detected in the middle of amplicon.txt, never at the border. There are about 50bp at each end where no ITD was found? We need to check the coverage of the amplicon, to the see whether reads are not properly aligned closer to the exon junction region. See the bottom of my reply for an update to help with this.
your FASTQ files, do they contain all the reads you sequenced, all the reads that mapped to FLT3, or all the reads that should map to the small amplicon.txt region within FLT3? In other words, how many reads are you expecting to align with getITD? Should it be 100%?
all getITD folders should contain out_needle, stats.txt and config.txt. The ins* and itd* files will only be present if insertions or itds were found. Please let me know if the stats and config files are really missing. In that case, there's a problem with getITD.

There are two minor issues with your command and the example output you showed. Your command has require_indel_free_primers set to True instead of False. The example output seems to filter for an alignment score of 50%, whereas the current version of getITD should filter for 40%.

To check coverage, I have just pushed another update: With the newest version, v1.5.6, an additional "coverage.txt" file will the saved for each sample. If you use -plot_coverage True, the data will additionally be plotted to coverage.png (you must install the python module matplotlib for this to work). Try running some of your samples with plotting enabled and check the coverage. If you could share a representative plot, that would be nice too. I'm curious what SureSelect looks like!

All the best, Tamara

BoudSTER commented 3 years ago

Hi Tamara,

Sorry for the delay, I don't have access to my data on weekends ^^.

Indeed that explains for 12 of my ITDs.
Yes I think I know the cause: From the BAM I extract the fastqs covering the FLT3 region and the unmapped reads. This important quantity of N must come from the unmapped reads I think. I will restart with your setting -min_read_length 60.
Ok I'll try with -min_bqs 20
Concerning the coverage I'll try with your settings and come back to you (-plot_coverage True).
No my FASTQs contain the FLT3 region and the unmapped reads only. I made a preliminary filter not to overwork getITD.
I just checked for some patients the folder contains only the out_needle (no stats.txt nor config.txt)

Ok I'll try your update. I'll get back to you with the graphics.

My new command : """ python3 /opt/getitd-develop/getitd.py -nkern ${task.thread} -require_indel_free_primers False -min_read_copies 1 -min_read_length 60 -min_bqs 20 -infer_sense_from_alignment True $sample_name ${sample_name}_R1.fastq ${sample_name}_R2.fastq """

Once again a big thank you for your help! :)

Have a nice day.

Augustin.

tjblaette commented 3 years ago

Hi Augustin,

command looks good. Could you share the command and log for the samples where getITD failed to produce the stat and config files? Could you try rerunning one of those and making sure there is no error? Config and stats files are created before the out_needle folder, so I'm not sure why those two would be missing when the out_needle directory is there.

You could also try running getITD only on the reads expected to map to the two exons in the getITD target region. That would allow us to troubleshoot alignment issues better.

Looking forward to the coverage plots!

Tamara

BoudSTER commented 3 years ago

Hi Tamara,

I created a nextflow pipeline that treats all samples in the same way. I've linked you the tree structure of my results file : https://drive.google.com/file/d/1VhMxR57z57KLiLOedK1r_6ZixrpeuxOo/view?usp=sharing For some patients not just the out_needle :/ Relaunch gives the same results.

You could also try running getITD only on the reads expected to map to the two exons in the getITD target region. That would allow us to troubleshoot alignment issues better. I'm not sure I understood you asking me to filter more precisely? i.e. the genomic position Exon14 + Exon 15 only? without the unmapped reads. But if getITD filters the unmapped reads not corresponding to FLT3 it shouldn't affect the sequel, right ?

All of my cover graphs are in this file: https://drive.google.com/drive/folders/1Wk1SQ66kDr_nMRSRV9ITJB2dVa8JlbAG?usp=sharing only a part has been generated I feel like

Thanks a lot for your help

Have a nice day.

Augustin

tjblaette commented 3 years ago

Hi Augustin,

Could you just manually rerun just one of the samples where getITD fails to create the stats and config files? If that still creates the out_needle folder and no stats/config, could you share the command, the log and the FASTQ files? I will try to troubleshoot it then. Just one sample, or even a portion of it will be fine, if it recreates the error.

I will have a look at your coverage data later. Thanks for sharing!

BoudSTER commented 3 years ago

Hi Tamara,

I don't understand! in manual I don't have any missing documents ... I have been trying since this afternoon to understand where the problem comes from. As soon as I know, I'll come back to you.

For the coverage graphics it seems ok to me?

BoudSTER commented 3 years ago

Hi Tamara,

You have succeeded with your settings has recovered almost all the patients: 103/114 patients and 125 ITDs out of 145 ITDs. The correlation between log(M/WT) fragment and log(ar) getITD R=0.57 p = >0.001%.

Sorry for my mistake with empty folder... a story of a symbolic link badly parameterized.

I'm going to pass negative patients due to the modification of parameterizations and I'll come back to you! with a specificity!

Many thanks :)

Have a nice day,

Augustin

tjblaette commented 3 years ago

Hi Augustin,

that's great news! I'm glad you mentioned the empty folders so that we could sort this out as well!

The number of ITDs found is now quite good, since we know that some are too large to be found with the short read length. Feel free to share a sample or two where a smaller ITD is still not found and you're not sure why. I can take a look then.

As for the correlation, this seems a bit low? In our publication we correlated VAFs from getITD with VAFs calculated from the AR reported by traditional fragment analysis and achieved R^2 > 0.95. Do you get better results correlating VAFs directly instead of log-scaled VAF and AR? (VAF = ar/(ar + 1) * 100 )

BoudSTER commented 3 years ago

Hi Tamara,

Yes, I am still sorry for this mistake.

Thank you for this proposal! I'm preparing a folder on my drive with the filtered fastq (FLT3 + unmapped read) with an array of values corresponding to the fragment.

Yes the correlation coefficient is quite low :/. I had put it down to our read depth. The problem is that our results in fragment analysis are rendered in M/WT (according to the ENL 2017 recommendations), so I don't have the VAF ... Nevertheless, while searching by computation I had determined from the getitd output: -counts = read mutated -ar = counts / (coverage - counts) which would be equivalent to an M/WT?

I have another problem: I have analyzed my 71 negative patients and with the modifications of the settings, they all come out positive (70/71 False positive) :/ has ratios sometimes > 0.01. If I increase -min_read_copies 10 I lose 10 ITDs among the positive ones :/. Can we play on other parameters in order to compensate for this false positive bp ?

Thank you very much for your help.

Augustin.

BoudSTER commented 3 years ago

Here is the file containing the few missing patients: https://drive.google.com/drive/folders/13D_0298Ufqx9OXYD7yaH7rUguYLuYB2Z?usp=sharing

tjblaette commented 3 years ago

Hi Augustin,

that is indeed a problem. What samples are you using as your negative controls? Are these healthy volunteers or FA-negative AML patients (i.e. could some of them at least potentially harbor a low allelic ITD?)?

getITD is quite sensitive to low sequencing quality when it comes to false positives, especially with some of the non-default parameters we are using for your data. As mentioned previously, what we have seen in some of our lower quality samples is many different trailing ITDs with a 6-7bp insert supported only by {-1} reads. Is this what you see? If it is, it might be worth filtering for longer insert sequences (min 9bp). This cannot be done automatically with getITD yet, but I can add the parameter if needed. The other thing that might help, which I have been meaning to implement, is more flexible read trimming: Instead of only taking out terminal Ns, I was thinking about trimming terminal bases below a certain base quality score. Have you tried setting the overall BQS filter back to 30? Like I said earlier, I wouldn't generally recommend lowering that cutoff. If that doesn't help and you can share some of the false positive controls, I can take a look myself as well.

The other possibility is multiplexing contamination. Were the control samples sequenced together with ITD-positive samples? If so, were these multiplexed using single or double unique indices/barcodes? Illumina reports up to 0.2% of incorrectly demultiplexed reads per sample for single-barcoded samples (see white paper here), which is why we use double-unique indices for our assay AND run negative control and post-treatment patient samples with low or no ITD levels present always separate from diagnostic/relapse samples. If the ITDs you are seeing in your control samples are caused by multiplexing issues, you should see the same ITD in your negative controls AND an actual patient that was sequenced on the same run.

Also, are you seeing likely false positives in your actual patients? If it's a general problem that is causing this many false positives, I would assume you see them in all samples. Is there a difference in coverage or sequencing quality between your patients and your controls?

I hope this helps, Tamara

BoudSTER commented 3 years ago

Hi Tamara,

that is indeed a problem. What samples are you using as your negative controls? Are these healthy volunteers or FA-negative AML patients (i.e. could some of them at least potentially harbor a low allelic ITD?)? My samples are FA-negative AMLs. I also thought it was also low ratio (FA limit of quantification 1%) but even with an after filter >0.01 70 patients are positive. These are insertions but they are very often multiples of 3 and according to the literature a multiple insertion of 3 has the same clinical consequence as a duplication (gain of function for the protein with constitutional activation of the receptor).
getITD is quite sensitive to low sequencing quality when it comes to false positives, especially with some of the non-default parameters we are using for your data. As mentioned previously, what we have seen in some of our lower quality samples is many different trailing ITDs with a 6-7bp insert supported only by {-1} reads. Is this what you see? If it is, it might be worth filtering for longer insert sequences (min 9bp). This cannot be done automatically with getITD yet, but I can add the parameter if needed. The other thing that might help, which I have been meaning to implement, is more flexible read trimming: Instead of only taking out terminal Ns, I was thinking about trimming terminal bases below a certain base quality score. Have you tried setting the overall BQS filter back to 30? Like I said earlier, I wouldn't generally recommend lowering that cutoff. If that doesn't help and you can share some of the false positive controls, I can take a look myself as well. Not really these are not small ITDs lengths the range goes from [12bp -63bp]. No I'll try to set the BQS back to 30 ;) I'll raise it again and get back to you. I'll add some fastqs with a high ratio to our file on the drive. I also added my results table of my 71 negative patients fragments.
The other possibility is multiplexing contamination. Were the control samples sequenced together with ITD-positive samples? If so, were these multiplexed using single or double unique indices/barcodes? Illumina reports up to 0.2% of incorrectly demultiplexed reads per sample for single-barcoded samples (see white paper here), which is why we use double-unique indices for our assay AND run negative control and post-treatment patient samples with low or no ITD levels present always separate from diagnostic/relapse samples. If the ITDs you are seeing in your control samples are caused by multiplexing issues, you should see the same ITD in your negative controls AND an actual patient that was sequenced on the same run. Yes, these are routine runs at the hospital so it is possible that these neg patients may have passed with positive patients. I don't think it's a multiplexing problem. In fact, we passed a series of other algorithms that do not output these false positives :/ has ratios > 0.01. I'm currently checking the runs to see if some patients are close to other positives and I'll get back to you ;).
Also, are you seeing likely false positives in your actual patients? If it's a general problem that is causing this many false positives, I would assume you see them in all samples. Is there a difference in coverage or sequencing quality between your patients and your controls? Yes there are also many more ITDs among my positive patients but since the patients were positive I didn't worry but for the AF Negative it's problematic.

I will try this new order on my 2 pools of patients after your precious advice: """ python3 /opt/getitd-master/getitd.py -nkern ${task.thread} -require_indel_free_primers False -min_read_copies 10 -min_read_length 60 -min_bqs 30 -infer_sense_from_alignment True $sample_name ${sample_name}_R1.fastq ${sample_name}_R2.fastq """

Thanks for everything :)

Have a nice day,

Augustin

tjblaette commented 3 years ago

Hi Augustin,

the VAF reported by getITD is already in percent. Thus, if the data was sequenced using single unique sample barcodes for multiplexing, you should use -filter_ins_vaf 0.2. If you wanted to filter for ITDs with a minimum VAF of 1% to match the sensitivity of FA, you would have to use -filter_ins_vaf 1.

Also note that getITD does not report inserts whose length is not a multiple of 3, unless they are "trailing" and presumably not sequenced fully. Thus, most of the reported insertions will be in-frame simply by design.

As for the false positives: I don't think I have yet seen a false positive non-trailing ITD reported by getITD. It is the trailing ITDs which are potentially less trustworthy, especially when the insert (column 9 of the output tsv files) is very short (< 9 bp). We did see in our samples also many more ITDs than were reported by FA previously. However, we believe that these are mostly true positives because i) we found the same additional ITDs in technical replicates of the same sample which were independently prepped and sequenced, ii) we found the same additional ITDs in bone marrow and peripheral blood samples in patients where both were independently prepped and sequenced and iii) we found the same additional ITDs in post-treatment follow-up samples which were prepped and sequenced independently of the original diagnostic sample. I would honestly not be surprised if low-allelic ITD clones could be found in other AML patients as well. So if the additional clones you see are non-trailing, I would assume they are mostly fine. If you see mostly trailing ITDs, especially with short insert sequences (column 9) and reverse read support only (column 10), that would be worrysome.

I'm not sure -min_read_copies 10 is the correct parameter. min_read_copies 10 will filter out all reads that do not occur at least 10 times in each sample. Our default is 2, which means we discard unique reads. Since SureSelect is not based on a few specific amplicons, I would rather set this to 1. 10 seems quite stringent. If you wanted to filter for a minimum number of total reads supporting a certain insertion, you can use -filter_ins_total_reads. That's a good idea actually. Try setting -filter_ins_total_reads 4 to mimic getITD's default of requiring two distinct reads with at least two copies each. You can also set it to a higher number, but filtering for VAF is probably better, as it is less sensitive to variations in coverage.

Thanks for adding the files, I will have a look! Tamara

BoudSTER commented 3 years ago

Hi Tamara,

For the false positives I understand well for the low ratios I applied a filter to eliminate them after the analysis. Nevertheless for an ITD of length 45 with a VAF at 30% M/WT at 0.4 should be detected in GS.

You are right for min_read_copies 1. I will stay on VAF filtering then.

I'm also looking on my side ;)

Thanks for your help.

Have a nice day,

Augustin

tjblaette commented 3 years ago

Hi Augustin,

I have run through the samples you sent me and checked the ITDs that getITD reports.

My command was for file in L*R1*; do OUT=$(echo $file | cut -f1 -d'_'); echo $OUT; python getitd/getitd.py -plot_coverage true -min_read_copies 1 -min_read_length 60 -nkern 10 -require_indel_free_primers false -infer_sense_from_alignment true $OUT ${OUT}*fastq; done. (I forgot to add -min_ins_vaf 0.2)

I first noticed that a few times there are two entries describing the same ITD, one as trailing, one as not trailing. This is because getITD currently does not merge trailing with non-trailing ITDs. (I was afraid small inserts would spuriously be merged into larger ones, and in our assay this is not a problem because ITDs are very rarely discovered as both trailing and non-trailing. In your assay, it occurs very often. I will have to see if there is a good way to deal with these double-entries.)

Looking at the specific ITDs themselves, the non-trailing ones look perfectly valid to me. I manually checked some of the supporting alignments and they are definitely ITDs. See attached examples for 27bp 16% VAF and 36bp 3% VAF ITDs in L1702796D01 as well as the 15bp 11% VAF ITD in L1805186D05. I don't know why GS did not detect these, but based on those reads, getITD was right in calling them. The 36bp ITD was also called as a trailing variant, which is also fine.

The two trailing ITDs in L1505897D11 (104bp 5% VAF) and L1906424D03 (97bp 0.5% VAF) look less convincing and could very well be false positives. For getITD they are edge cases, called because they could be true ITDs with sequencing errors and additional insert bases between the two tandems, or spurious alignments (which in these cases is probably more likely though). Possibly extending the reference sequence further into the intron might help. Have you tried this? (I can prepare a longer reference if you need help with that!)

All the best, Tamara

L1702796D01_27bp-non-trailing: augustin_L1702796D01_27bp-TP L1702796D01_36bp-non-trailing: augustin_L1702796D01_36bp-TP L1805186D05_15bp-non-trailing: augustin_L1805186D05_15bp-TP

L1906424D03_97bp-trailing: augustin_L1906424D03_97bp-FP L1505897D11_104bp-trailing: augustin_L1505897D11_104bp-FP

BoudSTER commented 3 years ago

Hi Tamara,

Thank you for these explanations for the positives. It is the most negative patients in FA that worries me :/. I have put inside our file a "p_neg" folder including some fastq of FPs as well as the summary table for all the negative patients.

I will try with a longer reference and come back to you.

Thank you for your help,

Have a nice day,

Augustin.

tjblaette commented 3 years ago

Hi Augustin,

I have run through the negative samples you have provided but for me there are no high-confidence ITDs reported, so getITD does seem to handle these perfectly fine. There is no need to worry about the insertions in ins*hc*. If you look at the coordinates, they all start at the very beginning (start=1) or end (start=329) of the amplicon and will be "overhanging" reads that go beyond the reference amplicon. These will always be there for SureSelect data, where reads are not limited to our small target region.

The command I used is for file in P*R1*; do OUT=$(echo $file | cut -f1 -d'_'); echo $OUT; python ../../getitd/getitd.py -plot_coverage true -min_read_copies 1 -min_read_length 60 -nkern 10 -require_indel_free_primers false -infer_sense_from_alignment true -filter_ins_vaf 0.2 $OUT ${OUT}*fastq; done

sample  length  start   vaf ar  coverage    counts  trailing    seq sense
P22-L1810262D11 6   329 0.36998 0.0037135   1892    7   False   GTACAG  {-1}
P22-L1810262D11 21  1   0.89552 0.0090362   335 3   False   GAGACTTGTGATCAAGAGACA   {1}
P22-L1810262D11 27  1   1.4925  0.015151    335 5   False   TAACTGACTCATCATTTCATCTCTGAA {1}
P22-L1810262D11 30  1   0.59701 0.0060060   335 2   False   TCCTAACTGACTCATCATTTCATCTCTGAA  {1}
P22-L1810262D11 33  1   1.4925  0.015151    335 5   False   TATTCCTAACTGACTCATCATTTCATCTCTGAA   {1}
P22-L1810262D11 45  329 60.840  1.5536  2620    1594    False   GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGG   {1, -1}
P22-L1810262D11 49  1   65.548  1.9026  894 586 True    TATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA   {1, -1}
P22-L1810262D11 54  1   2.3881  0.024465    335 8   False   TCCTCTATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA  {1}
P22-L1810262D11 63  329 0.47569 0.0047797   1892    9   False   GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGGAGGCACAGTTTCCCACCC {-1}

sample  length  start   vaf ar  coverage    counts  trailing    seq sense
P23-L1507694D15 6   329 0.25088 0.0025151   1993    5   False   GTACAG  {-1}
P23-L1507694D15 30  329 0.65228 0.0065656   1993    13  False   GTACAGTATAGTGGAAGGACAGCAACAAAG  {-1}
P23-L1507694D15 45  329 55.916  1.2684  2679    1498    False   GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGG   {1, -1}
P23-L1507694D15 49  1   59.821  1.4889  1003    600 True    TATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA   {1, -1}
P23-L1507694D15 54  1   3.0445  0.031401    427 13  False   TCCTCTATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA  {1}
P23-L1507694D15 54  329 0.60211 0.0060576   1993    12  False   GTACAGTATAGTGGAAGGACAGCAACAAAGATGCACAAAAATGGGAGGCACAGT  {-1}

sample  length  start   vaf ar  coverage    counts  trailing    seq sense
P85-L1808763D05 14  1   75.258  3.0417  194 146 True    ATTTCATCTCTGAA  {1, -1}
P85-L1808763D05 14  329 0.41347 0.0041518   1693    7   True    GTCCTTTTCTCTGC  {1}
P85-L1808763D05 33  329 0.70886 0.0071392   1975    14  False   GTACAGTATAGTGGAAGGACAGCAACAAAGATG   {-1}
P85-L1808763D05 42  1   1.4925  0.015151    134 2   False   AGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA  {1}
P85-L1808763D05 45  329 80.221  4.0559  2447    1963    False   GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGG   {1, -1}

sample  length  start   vaf ar  coverage    counts  trailing    seq sense
P91-L1708688D04 12  329 0.26017 0.0026085   4228    11  False   GTACAGTATAGT    {-1}
P91-L1708688D04 14  1   55.885  1.2668  2328    1301    True    ATTTCATCTCTGAA  {1, -1}
P91-L1708688D04 21  1   0.50505 0.0050761   990 5   False   GAGACTTGTGATCAAGAGACA   {1}
P91-L1708688D04 21  1   3.0303  0.031250    990 30  False   ACTCATCATTTCATCTCTGAA   {1}
P91-L1708688D04 30  329 0.82781 0.0083472   4228    35  False   GTACAGTATAGTGGAAGGACAGCAACAAAG  {-1}
P91-L1708688D04 33  329 0.47304 0.0047529   4228    20  False   GTACAGTATAGTGGAAGGACAGCAACAAAGATG   {-1}
P91-L1708688D04 36  1   0.50505 0.0050761   990 5   False   GCCTATTCCTAACTGACTCATCATTTCATCTCTGAA    {1}
P91-L1708688D04 45  329 53.205  1.1370  5445    2897    False   GTACAGTATAGTGGAAGGACAGCAACAAAGATGCCCAAAAATGGG   {1, -1}
P91-L1708688D04 54  1   1.3131  0.013306    990 13  False   TCCTCTATCTGCAGAACTGCCTATTCCTAACTGACTCATCATTTCATCTCTGAA  {1}

I have also just pushed another update to getITD, which fixes a small bug in VAF calculation. Be sure to check it out and let me know if you have any other questions or problems that I can help you with!

All the best, Tamara

BoudSTER commented 3 years ago

Thank you for everything Tamara!

I'm sorry I was in meetings all day Friday, I'm watching the VAF correction on Monday!

Augustin.

BoudSTER commented 3 years ago

Hi tamara! It's me again ^^

I've run patients from a new cohort and I have a new error that never appeared before : """ python3 /opt/getitd-develop/getitd.py -nkern 12 -require_indel_free_primers False -min_read_copies 1 -min_read_length 60 -min_bqs 20 -infer_sense_from_alignment True P16-L1305872D01_S16 P16-L1305872D01_S16_R1.fastq P16-L1305872D01_S16_R2.fastq

Command exit status:
1

Command output:
getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.

getITD requires the WT tandem to be fully sequenced, but this ITD's WT tandem was not covered completely.
The ITD was therefore considered a likely false positive and not called.
Report this warning if you feel a true positive was missed, otherwise safely ignore.

Collecting ITDs took 16.399568350985646 s
367 ITDs were found

-- Merging insertions --
422 insertions remain after merging
215 insertions remain after merging
197 insertions remain after merging
32 insertions remain after merging

-- Merging itds --
54 itds remain after merging

Command error:
Traceback (most recent call last):
File "/opt/getitd-master/getitd.py", line 2193, in
main(config)
File "/opt/getitd-master/getitd.py", line 2170, in main
merged_ins_anditds[type] = get_mergedinserts(inserts, type_, iref_coverage, config)
File "/opt/getitd-master/getitd.py", line 1861, in get_merged_inserts
save_to_file([insert.rep for insert in tomerge], type + "_collapsed-" + suffix + ".tsv", config)
File "/opt/getitd-master/getitd.py", line 1594, in save_to_file
to_save = [insert.prep_for_save(config) for insert in inserts]
File "/opt/getitd-master/getitd.py", line 1594, in
to_save = [insert.prep_for_save(config) for insert in inserts]
File "/opt/getitd-master/getitd.py", line 1087, in prep_for_save
assert to_save.end <= len(config["REF"]) AssertionError """ My coverage for my patients seems fine ?

Thanks for your help. Sorry to come back with problems :(

Have a nice day,

Augustin

tjblaette commented 3 years ago

Hi Augustin,

Can you share that specific sample? I'll look into it then!

All the best, Tamara

BoudSTER commented 3 years ago

Of courses :) Many many thanks

BoudSTER commented 3 years ago

I have just added it to our folder "assert_error"

tjblaette commented 3 years ago

Hi Augustin,

I hope I've got it fixed. I'm running some tests overnight and if it looks ok, I will push to develop tomorrow. And sorry for the back and forth! It seems SureSelect is a bit more tricky than I had anticipated. I'm sure we'll get it sorted though!

All the best, Tamara

BoudSTER commented 3 years ago

Hi Tamara,

No worries :)

Thanks to you for helping me! I look forward to your update.

Have a nice day,

Augustin.

tjblaette commented 3 years ago

Hi Augustin,

I have just pushed v1.5.10 to develop. It's only run back through some of our samples, but it's looking good so far. Please let me know how it goes and whether you encounter any problems with it!

All the best, Tamara

BoudSTER commented 3 years ago

Hi Tamara,

Yeah :) I can't wait to try it tomorrow! Thanks for your help.

good night,

Augustin.

BoudSTER commented 3 years ago

Hi Tamara,

Well done :) the first part of my new cohort seems to be working. I'll have the rest of the results within the next week!

Thank you very much!

Augustin.

tjblaette commented 3 years ago

Hi Augustin,

that is wonderful news! I was a bit scared at first when I saw your message. :D Please let me know how it goes, I'll keep my fingers crossed!

Have a good weekend, Tamara

tjblaette / getitd

Unexpected results #9