Open mullinyu opened 5 years ago
There is a way to remove the readme and java files, but for convenience, I would suggest that you unpack the zip file to a directory, remove all unnecessary files, and then create a test file with something like
head -1000 dbNSFP3.5a_variant.chr1 > test.chr1
and use it to test your .ann
file (e.g. vtools use .... --files test.chr1
.
If you see u'\uff0c, there might be some non-ascii characters in the annotations, which can be fixed by setting the encoding in the
.ann` file if this is the case.
Many thanks for advice.
It's a unicode char https://www.compart.com/en/unicode/U+FF0C
I also find many unicode other chars based on the method suggested by https://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters grep --color='auto' -P -n "[\x80-\xFF]" *.dbNSFP
I also checked the dnNSFP files generated by vtools based on the zip file downloaded from the official dataset are with ascii encoding file -bi dbNSFP3.5a_variant.chr10.dbNSFP text/plain; charset=us-ascii
Am I right that I should change all the generated dnNSFP files to unicode like suggested by https://www.shellhacks.com/linux-check-change-file-encoding/
[Or actaully, vtools can generate unicode datafile from zip file (official dataset) before loading into the database.]
If I really change all the dbNSFP3.5a_variant.chr[1.....22,X,Y,M].dbNSFP, how should I modify the command? Original: vtools use dbNSFP-hg38_3_5a.ann --files dbNSFPv3.5a.zip
What I should run?
Thanks!
Yes, I tried with 2 options but got the same error.
Here's my ann file
[linked fields] hg38=chr, pos, ref, alt hg19=hg19_chr, hg19_pos, ref, alt
[data sources] description=dbNSFP version 3.5a, maintained by Xiaoming Liu from UTSPH. Please cite: "Liu X, Jian X, and Boerwinkle E. 2011. dbNSFP: a lightweight database of human non-synonymous SNPs and their functional predictions. Human Mutation. 32:894-899. Liu X, Wu C, Li C and Boerwinkle E. 2016. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Non-synonymous and Splice Site SNVs. Human Mutation. 37(3):235-241." if you find this database useful. preprocessor=Dos2Unix() version=hg38_3_5a anno_type=variant source_url=ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFPv3.5a.zip direct_url=anno/dbNSFP-hg38_3_5a.DB.gz 9c5c3f580f32f1f336fa6130ad9a752f source_type=txt encoding=UTF-8
Error:
INFO: Importing annotation data from /home/mullinyu/project/exomedb/.vtools_cache/dbNSFP3.5a.readme.txt.dbNSFP
Process FileReader:
Traceback (most recent call last):
File "/home/mullinyu/anaconda2/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
self.run()
File "/home/mullinyu/anaconda2/lib/python2.7/site-packages/variant_tools/importer.py", line 458, in run
env.logger.debug('Failed to process line "{}...": {}'.format(line[:20].strip(), e))
UnicodeEncodeError: 'ascii' codec can't encode character u'\uff0c' in position 16: ordinal not in range(128)
I am just thinking how the existing annotation database (just having hg38) was created.
Hi,
I found the source code set "utf-8" as the default encoding but it seems not used from the error message I got. Therefore, I set it to system-level based on the link https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/
I made changes to the vtool file and no more that error.
import sys
try:
import argparse
except ImportError:
sys.exit('variant tools requires Python 2.7.2 or higher, or Python 3.2 '
'or higher. Please upgrade your version ({}) of Python and try again.'
.format(sys.version.split()[0]))
from variant_tools.utils import env
import subprocess
# save the command line that has been processed by the shell.
env.command_line = subprocess.list2cmdline(sys.argv[1:])
import variant_tools.project as project
import variant_tools.importer as importer
import variant_tools.update as update
import variant_tools.phenotype as phenotype
import variant_tools.annotation as annotation
import variant_tools.variant as variant
import variant_tools.compare as compare
import variant_tools.exporter as exporter
import variant_tools.pipeline as pipeline
if sys.platform != 'win32':
import variant_tools.liftOver as liftOver
import variant_tools.association as association
reload(sys)
sys.setdefaultencoding('utf-8')
def addCommonArgs(parser):
However, I got lots of warnings and errors, probably during loading the data to database (sqlite), e.g.
2019-04-26 18:21:02,385: DEBUG: Failed to process field 0.0;.;0.0;0.0;.: could not convert string to float: .
2019-04-26 18:21:02,385: DEBUG: Failed to process field -1.52;.;-1.52;-1.52;.: could not convert string to float: .
2019-04-26 18:21:02,385: WARNING: Inconsistent base allele G at 9709654 on chromosome 1
2019-04-26 18:21:02,385: DEBUG: Failed to process field 0.0;.;0.0;0.0;.: could not convert string to float: .
2019-04-26 18:21:02,385: DEBUG: Failed to process field -1.42;.;-1.42;-1.42;.: could not convert string to float: .
2019-04-26 18:21:02,386: WARNING: Inconsistent base allele G at 9709654 on chromosome 1
2019-04-26 18:21:02,386: DEBUG: Failed to process field 0.0;.;0.0;0.0;.: could not convert string to float: .
2019-04-26 18:21:02,386: DEBUG: Failed to process field -1.57;.;-1.57;-1.57;.: could not convert string to float: .
2019-04-26 18:21:02,387: DEBUG: Failed to process field 0.0;.;0.0;0.0;.: could not convert string to float: .
2019-04-26 18:21:02,387: DEBUG: Failed to process field -1.57;.;-1.57;-1.57;.: could not convert string to float: .
2019-04-26 18:21:02,387: DEBUG: Failed to process field 0.23;.;0.241;0.23;.: could not convert string to float: .
2019-04-26 18:21:02,387: DEBUG: Failed to process field -0.36;.;-0.33;-0.36;.: could not convert string to float: .
2019-04-26 18:21:02,388: WARNING: Inconsistent base allele C at 9709656 on chromosome 1
2019-04-26 18:21:02,388: DEBUG: Failed to process field 0.225;.;0.232;0.225;.: could not convert string to float: .
2019-04-26 18:21:02,388: DEBUG: Failed to process field -0.29;.;-0.28;-0.29;.: could not convert string to float: .
2019-04-26 18:21:02,388: WARNING: Inconsistent base allele C at 9709656 on chromosome 1
2019-04-26 18:21:02,389: DEBUG: Failed to process field 0.037;.;0.037;0.037;.: could not convert string to float: .
2019-04-26 18:21:02,389: DEBUG: Failed to process field -0.42;.;-0.4;-0.42;.: could not convert string to float: .
Nearly as locations got warnings that I have to terminate the job. Any advice?
Thanks!
Or is it possible to create an annotation database of dbNSFP3.5a with both hg38 and hg19 support and put to the web.
I can see there's an annotation database of dbNSFP2.4 with both hg18 and hg19 support on the web https://bioinformatics.mdanderson.org/Software/VariantTools/repository/annoDB/dbNSFP-hg18_hg19_2_4.ann
The official dataset contains hg38, hg19 and hg18 coordinates https://drive.google.com/file/d/0B60wROKy6OqcNGJ2STJlMTJONk0/view
Thanks!
Regards, Mullin
Ok. I will have a look today. Could you send me your .ann file?
Many thanks! Here is my version. I zip simply for uploading purpose. dbNSFP-hg38_3_5a.zip
Thanks. It will take a few days though because it is very time consuming to create large databases in sqlite and build indexes will take even longer.
Sure. Many thanks!!
Just download the data and tried your annotation file and noticed a bunch of
WARNING: hg19: Inconsistent base allele A at 114358819 on chromosome 2
As far as I can recall, we noticed this problem when we tried to build a hg19/hg38 version of the database. We contacted Dr. Liu, he acknowledged the problem and said that the hg19 index was obtained from some liftover mechanism from hg38, and some indexes are incorrect. This is why we did not use the hg19 indexes of this database.
So why do you want to have a hg19 version of dbNSFP v3_5a? I have not checked the recent version of dbNSFP but it might make more sense to build a database for the 4.0 version. The problem here is that dbNSFP has been getting larger and larger (v3 is 26G compressed) and downloading these files from our server has been a bit of stretching.
Thanks for your help.
Yes, because we want to have REVEL score and gnomAD data which are only existing in dbNSFP v3_5a or later. And also, we are using hg19 human genome.
I do agree version 4.0 has much bigger size, so I tried to work on v3_5 before and encountered those problems. It seems maybe the problem of dbNSFTP v3_5a database itself.
Any suggestion? Thanks!
Assuming v4 has the correct index for hg19, you could create an .ann
file with only the fields you need, which will result in MUCH faster build speed and much smaller DB file. You do not even need to include the hg38 index if you do not need it. We cannot do this because we do not know which fields are useful to users but you can build local databases freely.
So I think the best way to proceed would be:
.ann
file if needed.hg38
fields to hg19, and remove all the fields that you do not need
Hi,
I want to have an annotation database of dbNSFP v3_5a for hg19. But, I can only find a version just for hg38 only at https://bioinformatics.mdanderson.org/Software/VariantTools/repository/annoDB/
Then, I tried to follow the steps to create one based on the stated link http://varianttools.sourceforge.net/Annotation/New
I found it's only missing index 8,9,10,11 in https://bioinformatics.mdanderson.org/Software/VariantTools/repository/annoDB/dbNSFP-hg38_3_5a.ann, comparing the official dataset https://drive.google.com/file/d/0B60wROKy6OqcNGJ2STJlMTJONk0/view
In the .ann file, I should modify to the following
[linked fields] hg38=chr, pos, ref, alt hg19=hg19_chr, hg18_pos, ref, alt
Also, I need to add the following:
[hg19_chr] index=8 type=VARCHAR(20) NULL adj=Nullify('.') comment=chromosome as to hg19, "." means the same as in the chr column
[hg19_pos] index=9 type=INTEGER NULL adj=Nullify('.') comment=physical position on the chromosome as to hg19 (1-based coordinate)
[hg18_chr] index=10 type=VARCHAR(20) NULL adj=Nullify('.') comment=chromosome as to hg18, "." means the same as in the chr column
[hg18_pos] index=11 type=INTEGER NULL adj=Nullify('.') comment=physical position on the chromosome as to hg18 (1-based coordinate)
Then, I ran the following the command to create the annoatation database vtools use dbNSFP-hg38_3_5a.ann --files dbNSFPv3.5a.zip
But, I got the following error:
Any ideas? Or any other place having the dbNSFP v3_5a or later version for hg19
Thanks!