Can getITD be used to sequence data from targeted panel based on hybrid capture technology

tjblaette / getitd

Software for FLT3-ITD detection and MRD monitoring assay

MIT License

13 stars 4 forks source link

Can getITD be used to sequence data from targeted panel based on hybrid capture technology #5

Closed ydy1127 closed 4 years ago

ydy1127 commented 4 years ago

Dear Author:

Can getITD be used to sequence data from targeted panel based on hybrid capture technology?

I see information about amplicons and primers used in your program.

Looking forward to your reply!

tjblaette commented 4 years ago

Hi @ydy1127,

getITD can analyze panel sequencing data, yes! Our own assay was initially developed to detect FLT3-ITDs only, so we directly sequnced FLT3 amplicons. There is a lot of panel data out there though that also covers FLT3, so yes, it definitely makes sense to try getITD. I am happy to help with setting the right parameters and such, so if you need any help, just let me know!

ydy1127 commented 4 years ago

Dear Author：

Thank you for your reply!

The data I used was paired-end sequencing data from panel sequencing of AML patients based on hybrid capture technology, and I wanted to identify flt3-itd.The data was sequenced using Nextseq H and reads length was 150bp.

I tried to run getitd.py with the default parameter, but the following error message was prompted:

`-- Reading FASTQ files -- Reading FASTQ files took 374.1463442109525 s Number of total reads: 27837660 Exception in thread Thread-4: Traceback (most recent call last): File "/home/yuandy/localpython/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/home/yuandy/localpython/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "/home/yuandy/localpython/lib/python3.8/multiprocessing/pool.py", line 513, in _handle_workers cls._maintain_pool(ctx, Process, processes, pool, inqueue, File "/home/yuandy/localpython/lib/python3.8/multiprocessing/pool.py", line 337, in _maintain_pool Pool._repopulate_pool_static(ctx, Process, processes, pool, File "/home/yuandy/localpython/lib/python3.8/multiprocessing/pool.py", line 326, in _repopulate_pool_static w.start() File "/home/yuandy/localpython/lib/python3.8/multiprocessing/process.py", line 121, in start self._popen = self._Popen(self) File "/home/yuandy/localpython/lib/python3.8/multiprocessing/context.py", line 276, in _Popen return Popen(process_obj) File "/home/yuandy/localpython/lib/python3.8/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/home/yuandy/localpython/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork()

OSError: [Errno 12] Cannot allocate memory ` Looking forward to your reply!

tjblaette commented 4 years ago

Hi @ydy1127,

getITD is trying to allocate more memory (RAM) than your computer seems to have. What operating system are you using? How much memory do you have available on that machine? How much is getITD trying to allocate? If it fails at 4 Gb, it may be a problem with a 32 bit operating system or python. Otherwise your input FASTQ file might be too large, with 27 million reads. As getITD was built around our targeted assay, with 2-4 million reads per sample, memory demands weren't such an issue then. You can reduce file size by removing reads that map outside of FLT3, using a standard aligner such as BWA. When you do this, be sure to keep both reads that do map to FLT3 using this other aligner as well as those that remain unmapped. You probably already have a BAM file generated for general analysis of the other targets, so can filter respective reads from there and save them to a smaller FASTQ file to pass to getITD.

Let me know how it goes and whether there's anything I can do to help!

ydy1127 commented 4 years ago

Dear Author:

Thank you for your reply!

The system I'm using is Ubuntu 16.04.4 LTS, 64-bit. The system has 69G of available memory.I don't know how much memory getitd is trying to allocate. According to your explanation, I think maybe my FASTQ file is too big. Do you have any other suggestions？

In addition, I see a lot of information about primers in your parameters. With the data I used based on the capture of the hybridization, does it matter that these parameters take the default values?

Looking foward to your reply!

tjblaette commented 4 years ago

Hi @ydy1127,

If you watch the htop or top command while running getITD, you can judge memory usage. It definitely shouldn't use 69 Gb though, so very likely your FASTQ files are simply too big.

This is how you can make them smaller: Align them to the full reference genome using another aligner such as BWA-MEM. From the resulting alignment file you can extract reads that a) remained unmapped and b) mapped to FLT3, using for example SAMtools. And those reads can be saved as another, smaller FASTQ file that you can then input to getITD.

So, assuming your FASTQ files are named test_R1.fastq and test_R2.fastq, and you have BWA and SAMtools in your $PATH, you could do something like this:

bwa mem -M -t $CORES_TO_USE  $HG19_REF.FASTA   test_R1.fastq test_R2.fastq | picard SortSam INPUT=/dev/stdin OUTPUT=test.bam CREATE_INDEX=true SORT_ORDER=coordinate 

samtools view -f4 test.bam  > test_reads-of-interest.sam
samtools view test.bam chr13:28577411-28674729  >> test_reads-of-interest.sam

grep -Fw -A3 --no-group-separator -f <(cut -f1 test_reads-of-interest.sam | sort | uniq) test_R1.fastq > test_reads-of-interest_R1.fastq
grep -Fw -A3 --no-group-separator -f <(cut -f1 test_reads-of-interest.sam | sort | uniq) test_R2.fastq > test_reads-of-interest_R2.fastq

The resulting test_reads-of-interest_R?.fastq should be much smaller than the original test_R?.fastq. In this example code, I set the target region to chr13:28577411-28674729, which is the full length FLT3 gene in hg19. Change those coordinates to whatever makes sense with your panel (and reference genome, if you're not using hg19).

And don't worry about the primer parameters: These are optional filters but not necessarily required. Simply unset them so that getITD doesn't discard all of your reads by setting python3 getitd.py -require_indel_free_primers True.

Let me know if this helps, and above all stay well!

ydy1127 commented 4 years ago

Dear Author:

Thank you for your reply！

The solution you provided is so effective that the program can run successfully. But there are still two problems that confuse me：

1，When I run in the directory I created, I get the following error：

@bioubuntu:/work1/workspace_yuandanyang/ITD_Detection_from_ChosenMed/getITD_result$ time python3 /home/yuandy/biosoft/getitd-master/getitd.py HD15QZAA00570 HD15QZAA00570_reads-of-interest_R1.fastq HD15QZAA00570_reads-of-interest_R2.fastq 

Annotation file was not provided or cannot be accessed!
Traceback (most recent call last):
  File "/home/yuandy/biosoft/getitd-master/getitd.py", line 2050, in <module>
    main(config)
  File "/home/yuandy/biosoft/getitd-master/getitd.py", line 1863, in main
    config["DOMAINS"] = get_domains(config["ANNO"])
  File "/home/yuandy/biosoft/getitd-master/getitd.py", line 1423, in get_domains
    for i,row in anno.iterrows():
AttributeError: 'NoneType' object has no attribute 'iterrows'_

I consider that it is the program that cannot find the anno file.This is because the default value of the parameter -reference you provided is ./anno/amplicon.txt, and the default value of the parameter -anno is _./anno/ampliconkayser.tsv.

But these two parameters are optional parameters, I don't know why there is such a problem.The meaning of -reference parameter should be reference genome. Could I use the FLT3 interval of human reference genome (hg19) as the input of this parameter?

Run the above command in the getitd installation directory.The program is running normally. The results are as follows（itds_collapsed-is-same_is-similar_is-close_is-same_trailing.tsv）:
```
sample  length  start   vaf     ar      coverage        counts  trailing        seq     sense   external_bp ...
HD15QZAA00570   50      84      37.662  0.60416 77      29      True    AATATGAATATGATC {1}     0   ... 
```
If I want to get the exact insertion location, length, and mutation frequency of the ITD, which columns in the resulting file should I care about.

Looking forward to your reply! Thank you very much!

tjblaette commented 4 years ago

Hi @ydy1127,

I'm happy to hear you were able to run getITD successfully with the smaller FASTQ files!

As for your questions:

"Optional" parameters are optional because they have a default value. So the program will try to run without you explicitly setting these, but you as a user still have to ensure that their values make sense. If the default is not suitable, you have to change it. getITD cannot know where your annotation files are. It will try the default values, but if those don't exist, it will fail - that's what happens in your example above, where you try calling it from your custom directory. There's two things you can do: Either you pass the correct path for the annotation and reference files as parameters to getITD or you copy the annotation directory from the getITD installation folder to wherever you want to use it.
The reference parameter indeed refers to the reference sequence that you want your reads to be compared to. This should not be the entire genome though, but rather the gene- and assay- specific sequence that your reads come from. You cannot pass the genomic coordinates directly here. Instead, you must pass the path to a file containing the actual reference sequence. Look at the example file supplied with getITD to understand what it should look like. To start out, you can probably use the exact files that come with getITD but depending on your panel's design, you may want to extend them. You won't be able to detect ITDs outside of the reference sequence. When you do change it, just be sure to change the annotation file as well!
The columns of interest will be length for the ITD length, vaf for its Variant Allele Frequency and start_chr13_bp / end_chr13_bp for the genomic coordinates within chr13.

As always, hope this helps!

ydy1127 commented 4 years ago

Dear Author:

Hi! Thank you very much for your reply! I have some questions about your reply:

1, I think I can use your default file amplication.txt to complete the detection of ITD. Could you please tell me the position interval on the reference genome of the sequence information in amplication.txt ?

2, In the result file that I get, the length is 50, the start_chr13_bp is 28608269 ,theend_chr13_bp is 28608220. I can't understand why the value of the start_chr13_bp is greater than the value of end_chr13_bp.

3, In addition, there is another question about ITD that I would like to discuss with you. We all know that ITD repeats a sequence on the reference genome. Is the ITD location reported in this method the original location of the repeating reference genome sequence, or the location where the repeating sequence was start inserted?

Looking forward to your reply! Thank you very much!

tjblaette commented 4 years ago

Hi @ydy1127,

and no problem :)

The coordinates are annotated in the -anno file, have a look inside. It covers exons 14 and 15 in hg19, chr13:28608352:28608023.
The FLT3 gene is encoded by the reverse strand of the DNA, whereas genomic coordinates are based on the forward strand. That's why it looks like the start comes after the end.
getITD assumes that the first tandem is the WT sequence and the insert is inserted after it. The reported genomic coordinates, however, refer to the WT tandem, as the insert has no coordinates in the reference genome. (By definition, the reference sequence and coordinates refer to WT only.)

All the best!

ydy1127 commented 4 years ago

Dear Author:

Hi! Thank you very much! Your reply is very helpful to me. If I have any questions later, I will consult you. Thanks!

Best wishes

tjblaette commented 4 years ago

I'm glad to hear that, you are very welcome and yes, please do!