macarthur-lab / clinvar

This repo provides tools to convert ClinVar data into a tab-delimited flat file, and also provides that resulting tab-delimited flat file.
Other
122 stars 55 forks source link

Issue with R script #19

Closed VivekTodur closed 7 years ago

VivekTodur commented 8 years ago

Hi,

I am getting the following error, Could you please walk me through the way to resolve it. I am using python 2.7 on Ubuntu 14.10 machine. All required libraries are up to date.

180593 records processed
180594 records processed
[Nov 18 10:53:45]: Final counts of variants discarded: [Nov 18 10:53:45]: REF == ALT: 226 [Nov 18 10:53:45]: Wrong REF: 4082 [Nov 18 10:53:45]: Invalid nucleotide: 9 [Nov 18 10:53:45]: Finished 1.3. Running time: 0:00:21.418153 sec. [Nov 18 10:53:45]: Renamed tmp.2016-11-18_10.46.10.clinvar_table_normalized.tsv to clinvar_table_normalized.tsv [Nov 18 10:53:45]: --> Exec 1.4: Rscript join_data.R variant_summary.txt.gz [Nov 18 10:53:45]: Output (last mod N/A): clinvar_combined.tsv [doesn't exist yet] [Nov 18 10:53:50]: [1] 180594 14 [Nov 18 10:53:59]: Warning message: [Nov 18 10:53:59]: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : [Nov 18 10:53:59]: embedded nul(s) found in input [Nov 18 10:53:59]: [1] 354049 30 [Nov 18 10:53:59]: Error in [.data.frame(x, r, vars, drop = drop) : [Nov 18 10:53:59]: undefined columns selected [Nov 18 10:53:59]: Calls: subset -> subset.data.frame -> Nov 18 10:54:16: $subset(txt_download, assembly == "GRCh37", select = desired_columns) Nov 18 10:54:16: <environment: 0x2c70db0> Nov 18 10:54:16:
Nov 18 10:54:16: $subset.data.frame(txt_download, assembly == "GRCh37", select = desired_colu Nov 18 10:54:16: <environment: 0x2dd9a78> Nov 18 10:54:16:
Nov 18 10:54:16: $x[r, vars, drop = drop] Nov 18 10:54:16: <environment: 0x2d9c030> Nov 18 10:54:16:
Nov 18 10:54:16: $`[.data.frame(x, r, vars, drop = drop) [Nov 18 10:54:16]: <environment: 0x2d9c068> [Nov 18 10:54:16]: [Nov 18 10:54:16]: $stop("undefined columns selected")` Nov 18 10:54:16: <environment: 0x2d9a5c0> Nov 18 10:54:16:

Nov 18 10:54:16: Finished 1.4. Running time: 0:00:31.037980 sec. Nov 18 10:54:16: WARNING: unable to rename tmp.2016-11-18_10.46.10.clinvar_combined.tsv to clinvar_combined.tsv. tmp.2016-11-18_10.46.10.clinvar_combined.tsv is not readable Nov 18 10:54:16: --> Skipping 1.5: (cat clinvar_combined.tsv | head -1 > tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv ) && (cat clinvar_combined.tsv | tail -n +2 | egrep -v "^[XYM]" | sort -k1,1n -k2,2n -k3,3 -k4,4 >> tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv ) && (cat clinvar_combined.tsv | tail -n +2 | egrep "^[XYM]" | sort -k1,1 -k2,2n -k3,3 -k4,4 >> tmp.2016-11-18_10.46.10.clinvar_combined_sorted.tsv ) Nov 18 10:54:16: Input (last mod N/A): clinvar_combined.tsv Error - input file not found: Traceback (most recent call last): Nov 18 10:54:16: File "/home/ecgi6/.local/lib/python2.7/site-packages/pypez.py", line 769, in does_command_need_to_run Nov 18 10:54:16: try: raise Exception("File not found: " + str(input_filename)) Nov 18 10:54:16: Exception: File not found: clinvar_combined.tsv Nov 18 10:54:16:

kristjaneerik commented 8 years ago

I believe this is the error I was seeing with the latest update from ClinVar at #18. The R script tries to select columns from the variant summary table that are not there (but used to be).

There is currently no fix, other than to use older ClinVar data. For this download the ClinVar XML and variant summary table. Then run e.g.:

python master.py \
    -R <hg19.fa> \
    -E ExAC.r0.3.1.sites.vep.vcf.gz \
    -X ClinVarFullRelease_2016-09.xml.gz \
    -S variant_summary_2016-09.txt.gz

The R script is broken with data from 2016-10 onward it seems.

VivekTodur commented 8 years ago

Thanks, it works fine now with older data...

bw2 commented 7 years ago

21 from @XiaoleiZ has fixed parsing for the new datafiles, so we're back to making up-to-date releases.