hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
193 stars 59 forks source link

GRIPSS 2.2 RepeatMasker parsing error #323

Closed scwatts closed 2 years ago

scwatts commented 2 years ago

I'm trying out the new SV RepeatMasker annotation feature in GRIPSS 2.2 but an error is raised during parsing of the RM repeat library file.

This seems to be caused by two entries in the file that contain 14 fields rather than the expected 15. Both the repeat library file from the HMFTools-Resources Nextcloud server and from the RM website trigger the error. Removing the offending fields resolves the problem. A set of example commands to reproduce the error are provided below.

Example commands (click to show) ### Setup ```bash mkdir -p data/{sample,reference_data}/ RM_URL_BASE=https://www.repeatmasker.org/genomes/hg38 NC_URL_BASE='https://nextcloud.hartwigmedicalfoundation.nl/s/LTiKTd8XxBqwaiC' curl "${RM_URL_BASE}/RepeatMasker-rm405-db20140131/hg38.fa.out.gz" | gzip -cd > data/reference_data/hg38_original.fa.out curl "${NC_URL_BASE}/download?path=%2FHMFTools-Resources%2FRepeatmasker%2F38&files=38.fa.out" > data/reference_data/hg38_hmf.fa.out ``` > Other necessary files are manually placed into `./data/sample/` and `./data/reference_data/` ### Error using the original file Command ```bash output_dir=1_hg38_original/ mkdir ${output_dir} docker run -v $(pwd -P):/working/ -w /working/ quay.io/biocontainers/hmftools-gripss:2.2--hdfd78af_0 \ java -Xmx4g -jar /usr/local/share/hmftools-gripss-2.2-0/gripss.jar \ -sample SEQC-II_Tumor_50pc \ -vcf data/sample/sv_vcf.vcf.gz \ -ref_genome_version V38 \ -ref_genome data/reference_data/genomes/hg38.fa \ -pon_sgl_file data/reference_data/gridss_pon_single_breakend.38.bed \ -pon_sv_file data/reference_data/gridss_pon_breakpoint.38.bedpe \ -known_hotspot_file data/reference_data/known_fusions.38.bedpe \ -output_dir ${output_dir} \ -repeat_mask_file data/reference_data/hg38_original.fa.out ``` Error ```text 01:13:12.020 [INFO ] loading reference data 01:13:18.560 [INFO ] loaded 3103381 germline SV PON records from file(data/reference_data/gridss_pon_breakpoint.38.bedpe) 01:13:20.105 [INFO ] loaded 1520513 germline SGL PON records from file(data/reference_data/gridss_pon_single_breakend.38.bed) 01:13:20.120 [INFO ] loaded 458 known hotspot records from file Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 14 out of bounds for length 14 at com.hartwig.hmftools.gripss.rm.RepeatMaskAnnotations.load(RepeatMaskAnnotations.java:234) at com.hartwig.hmftools.gripss.GripssApplication.(GripssApplication.java:93) at com.hartwig.hmftools.gripss.GripssApplication.fromCommandArgs(GripssApplication.java:104) at com.hartwig.hmftools.gripss.GripssApplication.main(GripssApplication.java:373) ``` ### Error using the HMFtools-Resource file Command ```bash output_dir=2_hg38_hmf/ mkdir ${output_dir} docker run -v $(pwd -P):/working/ -w /working/ quay.io/biocontainers/hmftools-gripss:2.2--hdfd78af_0 \ java -Xmx4g -jar /usr/local/share/hmftools-gripss-2.2-0/gripss.jar \ -sample SEQC-II_Tumor_50pc \ -vcf data/sample/sv_vcf.vcf.gz \ -ref_genome_version V38 \ -ref_genome data/reference_data/genomes/hg38.fa \ -pon_sgl_file data/reference_data/gridss_pon_single_breakend.38.bed \ -pon_sv_file data/reference_data/gridss_pon_breakpoint.38.bedpe \ -known_hotspot_file data/reference_data/known_fusions.38.bedpe \ -output_dir ${output_dir} \ -repeat_mask_file data/reference_data/hg38_hmf.fa.out ``` Error ```text 01:14:29.636 [INFO ] loading reference data 01:14:36.300 [INFO ] loaded 3103381 germline SV PON records from file(data/reference_data/gridss_pon_breakpoint.38.bedpe) 01:14:37.846 [INFO ] loaded 1520513 germline SGL PON records from file(data/reference_data/gridss_pon_single_breakend.38.bed) 01:14:37.861 [INFO ] loaded 458 known hotspot records from file Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 14 out of bounds for length 14 at com.hartwig.hmftools.gripss.rm.RepeatMaskAnnotations.load(RepeatMaskAnnotations.java:234) at com.hartwig.hmftools.gripss.GripssApplication.(GripssApplication.java:93) at com.hartwig.hmftools.gripss.GripssApplication.fromCommandArgs(GripssApplication.java:104) at com.hartwig.hmftools.gripss.GripssApplication.main(GripssApplication.java:373) ``` ### Source of error There are two records in each RM repeat library file that have a missing column ```text $ for file in data/reference_data/hg38_{original,hmf}.fa.out; do echo -e "\n${file}"; awk 'NR>3 { print NF }' ${file} | sort | uniq -c; done data/reference_data/hg38_original.fa.out 2 14 5622514 15 data/reference_data/hg38_hmf.fa.out 2 14 5622514 15 ``` Removing these specific records allows GRIPSS 2.2 to successfully complete ```bash awk 'NR < 4 || NF == 15' data/reference_data/hg38_original.fa.out > data/reference_data/hg38_original_modified.fa.out output_dir=3_hg38_original_modified/ mkdir ${output_dir} docker run -v $(pwd -P):/working/ -w /working/ quay.io/biocontainers/hmftools-gripss:2.2--hdfd78af_0 \ java -Xmx4g -jar /usr/local/share/hmftools-gripss-2.2-0/gripss.jar \ -sample SEQC-II_Tumor_50pc \ -vcf data/sample/sv_vcf.vcf.gz \ -ref_genome_version V38 \ -ref_genome data/reference_data/genomes/hg38.fa \ -pon_sgl_file data/reference_data/gridss_pon_single_breakend.38.bed \ -pon_sv_file data/reference_data/gridss_pon_breakpoint.38.bedpe \ -known_hotspot_file data/reference_data/known_fusions.38.bedpe \ -output_dir ${output_dir} \ -repeat_mask_file data/reference_data/hg38_original_modified.fa.out ```
charlesshale commented 2 years ago

Yes, we just discovered this ourselves last week.

I will update our common resource TARs with this update shortly.

thanks.

scwatts commented 2 years ago

Great, thank you!

I have a more general question about the HMF reference data. We'd like to be using the most up-to-update reference files but in some cases a given file might be available individually and also in a 'DNA-Resources' tarball. For example, the HMF driver gene panel can be obtained from DNA-Resources/hmf_pipeline_resources.38_v5.30.gz or Gene-Panel/38/DriverGenePanel.38.tsv. There are some small differences between these two driver gene panel files and so we're not sure how to best compile the reference data set - should we prefer to use files in a 'DNA-Resources' tarball and include others such as the RM repeat library file where required?

charlesshale commented 2 years ago

Good question. They should be kept in sync and so both be the latest, but it seems some of the individual ones have not been updated. The ones in the DNA-Resources in a single TAR match our production pipeline, and so are currently a more reliable set to use.

I just checked and DNA-Resources/hmf_pipeline_resources.38_v5.30.gz has the fixed repeat masker entries.

scwatts commented 2 years ago

I missed the sv/38.fa.out.gz file in the hmf_pipeline_resources.38_v5.30.gz tarball, good to know. So that will mean only the older unedited HMFTools-Resources/Repeatmasker/38/38.fa.out repeat library will cause this error in GRIPSS 2.2.

Thanks for the advice, we'll start using reference files from the DNA-Resources bundles and add in others as needed.