hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
189 stars 58 forks source link

[PURPLE&LINX] Need helps with processing and interpretation #442

Closed nan5895 closed 1 year ago

nan5895 commented 1 year ago

Hello,

Thank you for the wonderful algorithms for cancer genomic analysis

I am working on PURPLE, LINX.

    java ${params.hmftools_purple_java_arg} -jar /opt/hmftools/purple/purple-3.8.4.jar \
        -reference ${normal_aliquot_barcode} \
        -tumor ${tumor_aliquot_barcode} \
        -output_dir PURPLE \
        -amber ${amber_dir} \
        -cobalt ${cobalt_dir} \
        -gc_profile ${hmftools_gc_cnp} \
        -ref_genome_version 37 \
        -ref_genome ${hmftools_ref_fasta} \
        -ensembl_data_dir ${hmftools_ref_ensembl_dir} \
        -somatic_vcf  ${pave_dir}/${tumor_aliquot_barcode}.sage.pave.vcf.gz \
        -somatic_sv_vcf ${gripss_dir}/${tumor_aliquot_barcode}.gripss.filtered.vcf.gz \
        -somatic_hotspots ${hmftools_ref_hotspots} \
        -driver_gene_panel ${hmftools_ref_driver_gene_panel} \
        -circos circos

This code works successfully processing PURPLE. However, it still creates Inferred SVs (vcfID "purple_0 ~~") even though I didn't provide `-sv_recovery_vcf' parameter. Is anything that I am missing ??

Also, When I use purple.sv.vcf as input for LINX, like the below code

    java -jar /opt/hmftools/linx/linx-1.23.1.jar \
        -sample ${tumor_aliquot_barcode} \
        -ref_genome_version 37 \
        -sv_vcf ${PURPLE_dir}/${tumor_aliquot_barcode}.purple.sv.vcf.gz \
        -purple_dir ${PURPLE_dir} \
        -output_dir LINX \
        -ensembl_data_dir ${hmftools_ref_ensembl_dir} \
        -check_fusions \
        -known_fusion_file ${hmftools_ref_known_fusion} \
        -check_drivers \
        -driver_gene_panel ${hmftools_ref_driver_gene_panel} \
        -proximity_distance 5000 \
        -write_vis_data \
        -annotations DOUBLE_MINUTES \
        -log_debug

This works wonderfully, creating results from purple output results with purple.sv.vcf.gz

I am looking for an SV pattern around the amplicon region...

I tried to use GRIPSS output sv.vcf as input for Linx instead of purple.sv.vcf

    java -jar /opt/hmftools/linx/linx-1.23.1.jar \
        -sample ${tumor_aliquot_barcode} \
        -ref_genome_version 37 \
        -sv_vcf ${gripss_dir}/${tumor_aliquot_barcode}.gripss.filtered.vcf.gz \
        -purple_dir ${PURPLE_dir} \
        -output_dir LINX \
        -ensembl_data_dir ${hmftools_ref_ensembl_dir} \
        -check_fusions \
        -known_fusion_file ${hmftools_ref_known_fusion} \
        -check_drivers \
        -driver_gene_panel ${hmftools_ref_driver_gene_panel} \
        -proximity_distance 5000 \
        -write_vis_data \
        -annotations DOUBLE_MINUTES \
        -log_debug

However, I found out it shows up weird junction chain copy number ...

  1. with .purple.sv.vcf.gz

    vcfId   svId    clusterId   clusterReason   fragileSiteStart    fragileSiteEnd  isFoldback  lineTypeStart   lineTypeEnd junctionCopyNumberMin   junctionCopyNumberMax   geneStart   geneEnd localTopologyIdStart    localTopologyIdEnd  localTopologyStart  localTopologyEnd    localTICountStart   localTICountEnd
    gridss1017bb_5372o  452 197 PROXIMITY-448   false   false   false   NONE    NONE    16.3963 33.0246 JUP;HAP1;RN7SL399P      22  24  ISOLATED_BE COMPLEX_OTHER   1   0
    gridss1016fb_19097o 453 197 PROXIMITY-438   false   false   false   NONE    NONE    24.6611 42.5102 IKZF3;KRT8P34       11  25  ISOLATED_BE ISOLATED_BE 1   0
    gridss1016bf_11644o 448 197 PROXIMITY-450;MAJOR_ALLELE_JCN-444;LONG_DEL_DUP_INV-445 false   false   false   NONE    NONE    17.0780 37.2061 IKZF3   JUP;HAP1    10  22  SAME_ORIENT ISOLATED_BE 0   1
    gridss1021bb_11473o 465 197 PROXIMITY-466   false   false   false   NONE    NONE    21.6174 42.8209         32  32  ISOLATED_BE ISOLATED_BE 1   1
    gridss1016ff_13089o 466 197 PROXIMITY-439   false   false   false   NONE    NONE    9.4722  56.3062         14  32  ISOLATED_BE ISOLATED_BE 1   1
  2. with .gripss.filtered.vcf.gz

vcfId   svId    clusterId   clusterReason   fragileSiteStart    fragileSiteEnd  isFoldback  lineTypeStart   lineTypeEnd junctionCopyNumberMin   junctionCopyNumberMax   geneStart   geneEnd localTopologyIdStart    localTopologyIdEnd
gridss1017bb_5372o  178 29  PROXIMITY-174   false   false   false   NONE    NONE    0.0000  0.0000  JUP;HAP1;RN7SL399P      15  17  ISOLATED_BE COMPLEX_OTHER   1   0
gridss1016fb_19097o 179 29  PROXIMITY-165   false   false   false   NONE    NONE    0.0000  0.0000  IKZF3;KRT8P34       5   18  ISOLATED_BE ISOLATED_BE 1   0
gridss1016bf_11644o 174 29  PROXIMITY-176   false   false   false   NONE    NONE    0.0000  0.0000  IKZF3   JUP;HAP1    4   15  SAME_ORIENT ISOLATED_BE 0   1
gridss1021bb_11473o 191 29  PROXIMITY-192   false   false   false   NONE    NONE    0.0000  0.0000          26  26  ISOLATED_BE ISOLATED_BE 1   1
gridss1016ff_13089o 192 29  PROXIMITY-166;LONG_DEL_DUP_INV-188  false   false   false   NONE    NONE    0.0000  0.0000          8   26  ISOLATED_BE ISOLATED_BE 1   1

As I showed above example.. when I used .gripss.filtered.vcf.gz directly... it gave weird junctionCopyNumberMin junctionCopyNumberMax

Could you advise me on this? Maybe the copy number segment is unsupported by an existing structural variant .gripss.filtered.vcf.gz ???

p-priestley commented 1 year ago

LINX expects the structural variant file to be annotated by PURPLE and to match the copy number output of PURPLE which is why we use the purple.sv.vcf.gz

Can you explain why you want to use gripss.filtered.vcf.gz instead? purple.sv.vcf.gz should be the same file, but with additional copy number annotations (and rescued variants if using). The behaviour of LINX without these annotations may be unpredictable

nan5895 commented 1 year ago

@p-priestley Thank you for your quick response

sv_vcf Full path and filename for the SV VCF, otherwill will use the Purple SV VCF (ie SAMPLE_ID.purple.vcf.gz) in the configured Purple directory

According to the https://github.com/hartwigmedical/hmftools/tree/master/linx#optional-additional-parameters, I thought it would perform similarly with gripss.filtered.vcf.gz.

For me, using SAMPLE_ID.purple.vcf.gz as input for LINX works great Gripss.filtered.vcf.gz and SAMPLE_ID.purple.vcf.gz are almost the same except for additional copy number annotations but performed differently in LINX ... So I was just curious about that issue.

However, I think I understand now that LINX expects the structural variant file to be annotated by PURPLE and to match the copy number output of PURPLE

Thank you