hartwigmedical / hmftools

Various algorithms for analysing genomics data
GNU General Public License v3.0
179 stars 56 forks source link

[PURPLE] invalid reference data #438

Closed jennyp76 closed 11 months ago

jennyp76 commented 11 months ago

Hi, hartwigmedical team I already ran AMBER and COBALT with no error. However, when I tried to run PURPLE using these outputs, I'm gettting the exact error everytime.

java -jar ./purple_v3.8.4.jar \ -threads 5 \ -reference ${SAMPLE}_Normal \ -tumor ${TUMOR} \ -amber $AMBER_AnalysisPath \ -cobalt $COBALT_AnalysisPath \ -ref_genome_version V38 \ -ref_genome $REF \ #/data/resource/reference/human/UCSC/hg38/BWAIndex/genome.fa -output_dir $PURPLE_AnalysisPath \ -ensembl_data_dir /data/project/BRCA1/script/PURPLE/ensembl_data/ \ -gc_profile /data/project/BRCA1/script/COBALT/GC_profile.1000bp.38.cnp

*Ensembl_data_dir includes the following files I download from "HMFtools-Resources" 1) HMFtools-Resources_dna_pipeline_v5_31_38_common_ensembl_data_ensembl_gene_data.csv 2)HMFtoolsResources_dna_pipeline_v5_31_38_common_ensembl_data_ensembl_protein_features.csv 3)HMFtoolsResources_dna_pipeline_v5_31_38_common_ensembl_data_ensembl_trans_exon_data.csv 4)HMFtoolsResources_dna_pipeline_v5_31_38_common_ensembl_data_ensembl_trans_splice_data.csv

The GC_profile is the same file I use when running COBALT.

I keep getting the following error. Can you help me with this problem? image

Thanks.

charlesshale commented 11 months ago

All missing files should log an ERROR message with the exception of the Ensembl data cache if a file is missing or the path is incorrect. Can you double-check that path?

Also try running it with -log_debug and see if it shows any more details?

thanks.

jennyp76 commented 11 months ago

I double-check "ensembl_data_dir" and confirm thath those 4 files are included in the ensembl_data folder. Also, no additional detail were given when running -log_debug ....

What could be the problem..?

charlesshale commented 11 months ago

Did you rename the Ensembl data files? They need to be named as per the HMF resources.

jennyp76 commented 11 months ago

These are the name of the files. I did not rename the data file, exact same as how they were downloaded from the resources folder. Is there something wrong with the names?

1)HMFtoolsResources_dna_pipeline_v5_31_38_common_ensembl_data_ensembl_gene_data.csv 2)HMFtoolsResources_dna_pipeline_v5_31_38_common_ensembl_data_ensembl_protein_features.csv 3)HMFtoolsResources_dna_pipeline_v5_31_38_common_ensembl_data_ensembl_trans_exon_data.csv 4)HMFtoolsResources_dna_pipeline_v5_31_38_common_ensembl_data_ensembl_trans_splice_data.csv

charlesshale commented 11 months ago

Those names need to have the prefix 'HMFtoolsResources_dna_pipeline_v5_31_38_common_ensembldata' dropped so they are just for example 'ensembl_gene_data.csv'.

The HMF resources also has them in this form, see: hmf-public/HMFtools-Resources/dna_pipeline/v5_31/38/common/ensembl_data/