stat-lab / MOPline

Detection and genotyping of structural variants
MIT License
15 stars 5 forks source link

Is it viable to utilize only three out of the seven tools provided in the MOPline-7tools suite for my analysis? #16

Open jingydz opened 4 months ago

jingydz commented 4 months ago

Hello, When selecting the input software for MOPline, I only had the results from Manta, CNVnator, and Wham softwares. Given that MOPline requires inputs from seven software, I utilized only three of them, and for the remaining four softwares, I provided empty files for various reasons. I am wondering if this would still be acceptable?

stat-lab commented 4 months ago

Unfortunately, the use of only 3 tools is not acceptable for MOPline. If the results of inGAP-sv can be added, you can run MOPline using merge_SV_vcf.4tools.pl, which is now uploaded. However, MOPline with the 4 tools would lead to a lower number of SV calls than that with 7 tools.

jingydz commented 4 months ago

Although MOPline does not provide an option for using only three tools, I have provided empty files for the other tools, and it seems to recognize them. The program runs without any warnings or errors, and the results can still distinguish between different continents. Is it still necessary to include inGAP-sv or other tools? inGAP-sv only recognizes SAM files, which would require a significant amount of storage and the computation is also very time-consuming.

stat-lab commented 4 months ago

As you say, MOPline can use empty vcf files from several tools and complete the execution without errors. However, in this case the number of SV calls is significantly reduced. For example, MOPline-7t using 7 tools selects INS overlap calls from tool pairs (or high-confidence calls from a single tool) as follows: ins_set => Manta:3=inGAP:10 Wham:3=inGAP:10 Manta:3=Wham:3 inGAP:18 Manta:28 MELT:4. When only SV call sets of CNVnator, Manta, and Wham are used, the ins_set is restricted to Manta:3=Wham:3 Manta:28 only. The run of inGAP-sv takes less than a half day, and the intermediate sam files per chromosome can be removed after the run (the run_inGAP.pl script removes all sam files after the run). I recommend to run inGAP-sv.

jingydz commented 4 months ago

I compared the public results of detecting Structural Variants (SVs) from your 414 1KGP individuals (1KG-SV.MOPline.vcf.gz), which amounted to 95,882 SVs. If I were to use only these three software tools and run the entire MOPline pipeline, including the filtering steps, for the same 414 samples, the number of SVs detected would be 53,728, which is indeed a smaller number.

Yes, you are correct. If three software tools are used, although the number of detected SVs is reduced, it is likely that the corresponding false positives would also decrease. I understand that the more software tools merged, the greater the number of SV results is likely to be, but I also need to consider computational time and storage.

As far as I know, for second-generation software that detects SVs, the overall reliability of inGAP-SV seems to be not very high, and among them, the Manta software performs the best. The inGAP-SV software only ranks higher in the detection of insertions (INS), and it ranks lower for other types. Moreover, I am more interested in focusing on duplications and deletions. For insertions, we already have results from others who have used MELT for detection. To avoid conflicts, I did not consider integrating MELT's results. Furthermore, in your previous software evaluation article, it seems that for insertions, MELT's results are more accurate than those of inGAP-SV.

Additionally, there were some reasons for me to abandon the results from inGAP-SV, which you suggested added. The first reason was the requirement for SAM files, which consume a significant amount of storage space. The second reason was that our SAM files might be too large (~300GB), potentially leading to running failures. Furthermore, the inGAP-SV software was released in 2011, and there have been no updates since then. It appears that the software is no longer being maintained, and the errors I encountered persist unresolved. Consequently, I ultimately decided to abandon this software.

inGAP_3_1_1]$ java -jar ./inGAP.jar SVP Program : inGAP Contact : Ji Qi [jxq11@psu.edu]

Usage : SVP ---------- -i input aln file (SAM) -r Reference file (FASTA) -o output SV file -QUL minimal SV quality ,default: 10 -SE minimal support PE read num ,default: 4 -PE minimal support SE read num ,default: 4 -CL minimal support DIFF_CH read num ,default: 4 -SIZE maximal SV size ,default: 100000 -XDEV time of standard deviation ,default: 3

Running: inGAP_3_1_1]$ java -jar ./inGAP.jar SVP -i /xxx/compare/HG002/HG002_bamfile/PALMER_anno/L1/chr1_1_1000000/region.sam -r /wgs/anno/hg38/v0/Homo_sapiens_assembly38.fasta -o HG002.ingap.txt

Result: -rw-r--r-- 1 zhangjj hesm 347M 11月 22 23:09 /xxx/compare/HG002/HG002_bamfile/PALMER_anno/L1/chr1_1_1000000/region.sam -rw-r--r-- 1 zhangjj hesm 107M Jan 3 10:38 HG002.ingap.txt

inGAP_3_1_1]$ head -n 6 HG002.ingap.txt

type size quality chr start end q_chr q_start q_end

unknown 356 0 chr1 11064 11419 chr1 11064 11419 homo=true r_cover=71.60;n_cover=63.31;b_cover=808.00 pair_1 type=0 chr=chr1 breaks=10642..181175 centers=10475..181429 read_peaks=10420..10750,181190..181610 num=16 save=true pair_2 type=0 chr=chr1 breaks=181600..182096 centers=181346..182263 read_peaks=181190..181590,182120..182360 num=6 save=true pair_3 type=0 chr=chr1 breaks=11064..11419 centers=10718..11629 read_peaks=10420..10750,11270..11660 num=4 save=true diff_1 type=-1 chr=chr1 breaks=10469..10601 centers=10450..10720 read_peaks=0..0,0..0 num=68 save=true

As you can see, when the size of sam file is small, everything is OK. But when I used my real data, I just tried one sample, which bam file is the smallest, it didn't worked.

fq: -r-xr-xr-x 1 20G 3月 10 2023 /xxx/sample_L1L3_124124.R1.fastq.gz -r-xr-xr-x 1 21G 3月 10 2023 /xxx/sample_L1L3_124124.R2.fastq.gz

bam: -rw-r--r-- 1 44G 6月 30 2020 /xxx/sample_L1L3_124124.Mdup.recal.bam

Running: samtools view -h /xxx/sample_L1L3_124124.Mdup.recal.bam > /xxx/sample.sam

result: -rw-r--r-- 1 zhangjj hesm 283G 1月 10 19:58 /xxx/sample.sam

Running: java -jar /xxx/zhangjj/software/inGAP_3_1_1/inGAP.jar SVP -i /xxx/sample.sam -r /wgs/anno/hg38/v0/Homo_sapiens_assembly38.fasta -o /xxx/sample.ingap.txt

Error: counting [66%], mem=7309M counting [67%], mem=5879M counting [68%], mem=9066M Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 22286674 at inGAP.RecombinationFinder.RecombinationFinder.countCoverage(RecombinationFinder.java:337) at inGAP.RecombinationFinder.RecombinationFinder.execute(RecombinationFinder.java:81) at inGAP.Main.Main.startSVPrediction(Main.java:484) at inGAP.Main.Main.main(Main.java:180)

Then I tried to set memory size. Running: java -mx2000m -jar /parastor300/work01/zhangjj/software/inGAP_3_1_1/inGAP.jar SVP -i /xxx/sample.sam -r /wgs/anno/hg38/v0/Homo_sapiens_assembly38.fasta -o /xxx/sample.ingap.txt

And it still didn't work. Error: startSVPrediction SAM file : /xxx/sample.sam Out file : /xxx/sample.ingap.txt SV qual: 10 Get reads length: /wgs/anno/hg38/v0/Homo_sapiens_assembly38.fasta Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at inGAP.RecombinationFinder.ContigCoverage.(ContigCoverage.java:58) at inGAP.RecombinationFinder.RecombinationFinder.execute(RecombinationFinder.java:77) at inGAP.Main.Main.startSVPrediction(Main.java:484) at inGAP.Main.Main.main(Main.java:180)

And this sample is already the smallest in size among the BAM files I have obtained (44G), with many others exceeding 100G. It is estimated that the decompressed SAM files will be even larger (exceeding 300G).

Have you encountered similar issues before? Considering such situations, since MOPline supports the provision of empty files, would it be possible for me to use the merged results from the three software tools?

stat-lab commented 4 months ago

inGAP-sv shows overall low precision but high recall. SV calls from inGAP-sv with a high RSS (reads supporting an SV) and overlap SV calls between inGAP-sv and another specific tool show high precision. So, the inclusion of inGAP-sv for MOPline does not lead to increased false positive calls. For the inGAP-sv run, the input bam should be split into each chromosome followed by conversion to sam (see the section "Notes on SV calling and input bam"). If necessary, the converted sam for a chromosome may be deleted after each inGAP-sv run.

jingydz commented 4 months ago

Thank you for your quickly reply. I will conduct further tests on my machine regarding the runtime and storage usage for splitting the chromosomes, converting to SAM format, and detecting with inGAP-SV.

Additionally, I am curious to know if the results from merging outputs of three software tools would be worse than the results from a single software tool. I understand that the downside is fewer detected sites compared to using seven software tools, but are there any other disadvantages that I might have overlooked?

Furthermore, what are the differences between the new merge_SV_vcf.4tools.pl and the previous 7tools?

jingydz commented 4 months ago

Does inGAP-sv not support multi-threading? After I have split the BAM file into individual chromosome SAM files, do I also need to split the FASTA file into individual chromosomes? Would this make the process faster or use fewer computational resources?

stat-lab commented 4 months ago

The number of SV calls of MOPline could be lower than that of another single tool but, in most cases, the precision of MOPline calls would be higher than that of a single tool. The precision of SV calls with MOPline-4t (4-tool version: merge_SV_vcf.4tools.pl) is similar to that of MOPline-7t (7-tool version), but the recall of MOPline-4t would be lower than that of MOPline-7t (see Supplemental Figures S3-S7 of our MOPline paper).

inGAP-sv does not support multi-threading, and requires reference fasta files split into each chromosome as well as sam files. The run of inGAP-sv with the whole chromosome data will be failed: this is the specification of inGAP-sv.

jingydz commented 4 months ago

Thank you for your reply. Yesterday, I successfully ran the process to split the entire individual's bam file into sam files for each chromosome. However, I provided the entire genome fasta file (Homo_sapiens_assembly38.fasta), which the tool was able to recognize and run successfully. But I found that it is running very slowly.

It took approximately 18 hours to complete the detection of a single chromosome (chr1).

086981D.marked.realigned.recal.bam (149G) 1171685.admin zhangjj Fat 086981D 25641 1 4 -- 4320:00:0 R 23:55:10 -rw-r--r-- 1 zhangjj hesm 927M May 10 15:42 086981D.marked.realigned.recal.bam.chr1.sam.ingapsv.result

Next, I will use the split fasta files to check the running speed.

jingydz commented 4 months ago

Hi, I use the split fasta files and split sam files to run ingap-sv.

It did indeed significantly improve computational speed. In my new approach that used the split fasta reference genome, each sample can complete the analysis of all 24 chromosomes in a total of 16 hours, whereas when I used the entire reference genome, the analysis of a single chromosome alone took up to 18 hours.

However, the two methods of providing fasta files yield different results. When I provide the complete fasta file, the resulting files are larger, ranging in size from 800MB to 3GB. When I provide split fasta files, the resulting files are smaller, ranging from 80KB to 500KB. It seems that the former contains more results. Should I choose the results obtained from providing split fasta files?

stat-lab commented 4 months ago

You should use the results from individual fasta files.

jingydz commented 3 months ago

Hi, in your article, you mentioned that 'inGAP is highly sensitive to PCR-duplicates in input sam files. When a sam file containing 10% PCR duplicates is used, inGAP often generates more than twice as many calls compared with sam without PCR-duplicates.' In my cohort of samples, a small portion did not used PCR-free library construction, which could lead to an overestimation of results detected by inGAP-sv. Have you considered this issue in subsequent merging processes? Even incorporating such results will not reduce accuracy, is that correct?

Additionally, have you contemplated the idea of releasing MOPline-3tools? Building on the previous three of softwares, I am currently incorporating results from inGAP-sv on my cohort, but even with the use of split fasta files and split sam files, it still requires a considerable amount of time and memory to run, and the process is very I/O intensive, impacting the operation of my entire cluster and even the terminal. Do you have any effective methods to recommend for alleviating this situation when processing 3k BBJ data? 1175099.admin zhangjj Fat sample 24362 1 4 -- 4320:00:0 R 35:23:33 1175256.admin zhangjj Fat sample 39623 1 4 -- 4320:00:0 R 33:18:45 1175257.admin zhangjj Fat sample 40159 1 4 -- 4320:00:0 R 33:18:15 1175258.admin zhangjj Fat sample 34170 1 4 -- 4320:00:0 R 33:01:44 1175263.admin zhangjj Fat sample 77346 1 4 -- 4320:00:0 R 32:07:02 1175277.admin zhangjj Fat sample 65797 1 4 -- 4320:00:0 R 30:20:41

stat-lab commented 3 months ago

If your run of inGAP-sv completed within a day, there would be no problem on PCR-duplicates. There are no plans to create a three-tool version of MOPline, as the advantages of MOPline cannot be exploited if only three tools are available.

jingydz commented 3 months ago

Ok, thank you. Most of the samples can be finished in 40 hours from the have run inGAP-sv completed samples. Then I will add this result.

jingydz commented 3 months ago

I am so sorry to bother you so frequently.

In my case and control cohort, I merged all the samples to joint-regenotype them (joint and smc). But I got some strange results. There were some sites with very high allele frequency in case cohort, however, I can't find how they generated.

For example, chrX 761501 SVTYPE=DUP case_AF=0.979572 control_AF=0 0.979572 ( the AF difference of case and control)

tabix /xxx/MOPline/case_control/cohort/smc/SV.merge.control.vcf.gz chrX:761501-761502 |less chrX 761501 . . . PASS SVTYPE=DUP;SVLEN=5000;END=766214;SC=514;AF=0;AC=0;AN=number;DPR=2.3;SR=0.04;DPS=0.03;MAF=0 GT:GQ:VP:VL:DR:DS:SR:MC:AN 0/0:0:0:0:0.89:0.87:0.04:0:. 0/0:0:0:0:0.9:0.81:0:0:. 0/0:0:0:0:0.93:0.89:0:0:. 0/0:0:0:0:1.08:0.53:0:0:. 0/0:0:0:0:0.94:0.83:0.02:0:. 0/0:0:0:0:1.2:0.36:0:0:. 0/0:0:0:0:1.23:0.3:0.06:0:. 0/0:0:0:0:0.95:0.64:0:0:. 0/0:0:0:0:1.01:0.59:0.03:0:. 0/0:0:0:0:0.96:0.6:0:0:. 0/0:0:0:0:1.23:0.27:0.06:0:. 0/0:0:0:0:1.06:0.65:0.04:0:. 0/0:0:0:0:0.88:0.85:0:0:. 0/0:0:0:0:1.16:0.4:0:0:. 0/0:0:0:0:0.85:0.9:0.04:0:. 0/0:0:0:0:1.08:0.56:0:0:. 0/0:0:0:0:1.09:0.48:0:0:. 0/0:0:0:0:0.96:0.73:0:0:. 0/0:0:0:0:1.27:0.25:0:0:. 0/0:3:0:0:1.3:0.2:0.03:0:. 0/0:0:0:0:1.05:0.65:0.03:0:. 0/0:0:0:0:1.29:0.26:0.04:0:. 0/0:0:0:0:1.18:0.33:0.04:0:. 0/0:0:0:0:1.1:0.46:0.04:0:. 0/0:0:0:0:0.86:0.92:0:0:. 0/0:0:0:0:0.97:0.71:0:0:. 0/0:0:0:0:1.11:0.46:0:0:. 0/0:0:0:0:1.15:0.43:0:0:. 0/0:0:0:0:1.24:0.33:0.03:0:. 0/0:0:0:0:0.99:0.69:0:0:. 0/0:0:0:0:1.22:0.36:0:0:. 0/0:0:0:0:0.94:0.77:0:0:. 0/0:0:0:0:0.94:0.78:0:0:. 0/0:0:0:0:1.1:0.48:0:0:. 0/0:0:0:0:1.12:0.43:0:0:. 0/0:0:0:0:1.16:0.51:0:0:. 0/0:0:0:0:1.2:0.37:0:0:. 0/0:0:0:0:1.19:0.38:0:0:. 0/0:0:0:0:1.17:0.4:0:0:. 0/0:0:0:0:1.19:0.39:0:0:. 0/0:0:0:0:0.79:0.95:0:0:. 0/0:0:0:0:1.21:0.37:0.04:0:. 0/0:0:0:0:0.98:0.72:0.04:0:.

tabix /xxx/MOPline/case_control/cohort/smc/SV.merge.case.vcf.gz chrX:761501-761502 |less chrX 761501 . . . PASS SVTYPE=DUP;SVLEN=5000;END=766214;SC=514;AF=0.979572;AC=1007;AN=1028;DPR=2.3;SR=0.04;DPS=0.03;MAF=0.020428 GT:GQ:VP:VL:DR:DS:SR:MC:AN 1/1:69:761501:5000:2.53:0.02:0.06:2:. 0/1:13:761501:4714:1.62:0.06:0.06:3:. 1/1:88:761501:5000:2.88:0.02:0.02:2:. 1/1:81:761501:5000:2.68:0.03:0.07:2:. 1/1:24:761501:5000:1.78:0.11:0.1:2:. 1/1:26:761501:5000:1.81:0.08:0.1:2:. 1/1:51:761501:5000:2.27:0.04:0.06:2:. 1/1:33:761501:5000:2.07:0:0.03:2:. 1/1:38:761501:5000:2.06:0.06:0.07:2:. 1/1:28:761501:5000:1.87:0.06:0.09:2:. 1/1:13:761501:4714:1.79:0.08:0:3:. 1/1:84:761501:5000:2.78:0.01:0.04:2:. 1/1:59:761501:5000:2.42:0.04:0.04:2:. 0/1:8:761501:4000:1.55:0.08:0.03:0:. 1/1:53:761501:5000:2.42:0:0:2:. 1/1:16:761501:5000:1.84:0.1:0:2:. 1/1:122:761501:5000:3.32:0:0.04:2:. 1/1:63:761501:5000:2.46:0.03:0.05:2:. 1/1:91:761501:5000:2.85:0.03:0.06:2:. 1/1:70:761501:5000:2.56:0.03:0.05:2:. 1/1:39:761501:5000:2.1:0:0.06:2:. 1/1:46:761501:5000:2.31:0.04:0:2:. 1/1:45:761501:5000:2.09:0.02:0.11:2:. 1/1:83:761501:5000:2.79:0:0.03:2:. 1/1:41:761501:5000:2.24:0.04:0:2:. 1/1:77:761501:5000:2.76:0.05:0:2:. 1/1:24:761501:5000:1.92:0.05:0.03:2:. 1/1:51:761501:5000:2.26:0.03:0.07:2:. 1/1:75:761501:5000:2.6:0.01:0.07:2:. 1/1:24:761501:5000:1.98:0.02:0:2:.

It shows that all my case samples have this variation. But I select one case sample to display, I can't find this variation.

tabix /xxx/MOPline/case_control/cohort/100A/Merge_7tools/100A.Merge.ALL.vcf.gz chrX:761501 |less -S chrX 785501 . . . PASS SVTYPE=DUP;SVLEN=63500;END=849000;READS=10;TOOLS=CNVnator:785501:63500:10 GT:GQ:VP:VL:DR:DS:SR ./.:0:785501:63500:1.25:0.33:0

CNVnator result: chrX 785501 DUP . . . PASS SVTYPE=DUP;SVLEN=63500;RDRATIO=2.18546;READS=10;GT=./.

Wham result: none

Manta result: none