Closed scwatts closed 2 years ago
Yes, we just discovered this ourselves last week.
I will update our common resource TARs with this update shortly.
thanks.
Great, thank you!
I have a more general question about the HMF reference data. We'd like to be using the most up-to-update reference files but in some cases a given file might be available individually and also in a 'DNA-Resources' tarball. For example, the HMF driver gene panel can be obtained from DNA-Resources/hmf_pipeline_resources.38_v5.30.gz
or Gene-Panel/38/DriverGenePanel.38.tsv
. There are some small differences between these two driver gene panel files and so we're not sure how to best compile the reference data set - should we prefer to use files in a 'DNA-Resources' tarball and include others such as the RM repeat library file where required?
Good question. They should be kept in sync and so both be the latest, but it seems some of the individual ones have not been updated. The ones in the DNA-Resources in a single TAR match our production pipeline, and so are currently a more reliable set to use.
I just checked and DNA-Resources/hmf_pipeline_resources.38_v5.30.gz has the fixed repeat masker entries.
I missed the sv/38.fa.out.gz
file in the hmf_pipeline_resources.38_v5.30.gz
tarball, good to know. So that will mean only the older unedited HMFTools-Resources/Repeatmasker/38/38.fa.out
repeat library will cause this error in GRIPSS 2.2.
Thanks for the advice, we'll start using reference files from the DNA-Resources bundles and add in others as needed.
I'm trying out the new SV RepeatMasker annotation feature in GRIPSS 2.2 but an error is raised during parsing of the RM repeat library file.
This seems to be caused by two entries in the file that contain 14 fields rather than the expected 15. Both the repeat library file from the HMFTools-Resources Nextcloud server and from the RM website trigger the error. Removing the offending fields resolves the problem. A set of example commands to reproduce the error are provided below.
Example commands (click to show)
### Setup ```bash mkdir -p data/{sample,reference_data}/ RM_URL_BASE=https://www.repeatmasker.org/genomes/hg38 NC_URL_BASE='https://nextcloud.hartwigmedicalfoundation.nl/s/LTiKTd8XxBqwaiC' curl "${RM_URL_BASE}/RepeatMasker-rm405-db20140131/hg38.fa.out.gz" | gzip -cd > data/reference_data/hg38_original.fa.out curl "${NC_URL_BASE}/download?path=%2FHMFTools-Resources%2FRepeatmasker%2F38&files=38.fa.out" > data/reference_data/hg38_hmf.fa.out ``` > Other necessary files are manually placed into `./data/sample/` and `./data/reference_data/` ### Error using the original file Command ```bash output_dir=1_hg38_original/ mkdir ${output_dir} docker run -v $(pwd -P):/working/ -w /working/ quay.io/biocontainers/hmftools-gripss:2.2--hdfd78af_0 \ java -Xmx4g -jar /usr/local/share/hmftools-gripss-2.2-0/gripss.jar \ -sample SEQC-II_Tumor_50pc \ -vcf data/sample/sv_vcf.vcf.gz \ -ref_genome_version V38 \ -ref_genome data/reference_data/genomes/hg38.fa \ -pon_sgl_file data/reference_data/gridss_pon_single_breakend.38.bed \ -pon_sv_file data/reference_data/gridss_pon_breakpoint.38.bedpe \ -known_hotspot_file data/reference_data/known_fusions.38.bedpe \ -output_dir ${output_dir} \ -repeat_mask_file data/reference_data/hg38_original.fa.out ``` Error ```text 01:13:12.020 [INFO ] loading reference data 01:13:18.560 [INFO ] loaded 3103381 germline SV PON records from file(data/reference_data/gridss_pon_breakpoint.38.bedpe) 01:13:20.105 [INFO ] loaded 1520513 germline SGL PON records from file(data/reference_data/gridss_pon_single_breakend.38.bed) 01:13:20.120 [INFO ] loaded 458 known hotspot records from file Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Index 14 out of bounds for length 14 at com.hartwig.hmftools.gripss.rm.RepeatMaskAnnotations.load(RepeatMaskAnnotations.java:234) at com.hartwig.hmftools.gripss.GripssApplication.