vatlab / varianttools

software tool for the manipulation, annotation, selection, and analysis of variants in the context of next-gen sequencing analysis
https://vatlab.github.io/vat-docs/
GNU General Public License v3.0
31 stars 4 forks source link

Problem in creating a new annotation database dbNSFP v3_5a for hg19 #104

Open mullinyu opened 5 years ago

mullinyu commented 5 years ago

Hi,

I want to have an annotation database of dbNSFP v3_5a for hg19. But, I can only find a version just for hg38 only at https://bioinformatics.mdanderson.org/Software/VariantTools/repository/annoDB/

Then, I tried to follow the steps to create one based on the stated link http://varianttools.sourceforge.net/Annotation/New

I found it's only missing index 8,9,10,11 in https://bioinformatics.mdanderson.org/Software/VariantTools/repository/annoDB/dbNSFP-hg38_3_5a.ann, comparing the official dataset https://drive.google.com/file/d/0B60wROKy6OqcNGJ2STJlMTJONk0/view

In the .ann file, I should modify to the following

[linked fields] hg38=chr, pos, ref, alt hg19=hg19_chr, hg18_pos, ref, alt

Also, I need to add the following:

[hg19_chr] index=8 type=VARCHAR(20) NULL adj=Nullify('.') comment=chromosome as to hg19, "." means the same as in the chr column

[hg19_pos] index=9 type=INTEGER NULL adj=Nullify('.') comment=physical position on the chromosome as to hg19 (1-based coordinate)

[hg18_chr] index=10 type=VARCHAR(20) NULL adj=Nullify('.') comment=chromosome as to hg18, "." means the same as in the chr column

[hg18_pos] index=11 type=INTEGER NULL adj=Nullify('.') comment=physical position on the chromosome as to hg18 (1-based coordinate)

Then, I ran the following the command to create the annoatation database vtools use dbNSFP-hg38_3_5a.ann --files dbNSFPv3.5a.zip

But, I got the following error:

newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a.readme.txt with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr1 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr10 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr11 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr12 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr13 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr14 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr15 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr16 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr17 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr18 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr19 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr2 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr20 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr21 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr22 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr3 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr4 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr5 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr6 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr7 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr8 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chr9 with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chrM with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chrX with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5a_variant.chrY with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5_gene with
 newline charater to unix format.roject/exomedb/.vtools_cache/dbNSFP3.5_gene.complete with
 newline charater to unix format.roject/exomedb/.vtools_cache/LICENSE.txt with
 newline charater to unix format.roject/exomedb/.vtools_cache/search_dbNSFP35a.class with
 newline charater to unix format.roject/exomedb/.vtools_cache/search_dbNSFP35a.java with
 newline charater to unix format.roject/exomedb/.vtools_cache/search_dbNSFP35a.readme.pdf with
 newline charater to unix format.roject/exomedb/.vtools_cache/try.vcf with
 newline charater to unix format.roject/exomedb/.vtools_cache/tryhg18.in with
 newline charater to unix format.roject/exomedb/.vtools_cache/tryhg19.in with
 newline charater to unix format.roject/exomedb/.vtools_cache/tryhg38.in with
INFO: Importing annotation data from /home/mullinyu/project/exomedb/.vtools_cache/dbNSFP3.5a.readme.txt.dbNSFP
Process FileReader:
Traceback (most recent call last):
  File "/home/mullinyu/anaconda2/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/mullinyu/anaconda2/lib/python2.7/site-packages/variant_tools/importer.py", line 458, in run
    env.logger.debug('Failed to process line "{}...": {}'.format(line[:20].strip(), e))
UnicodeEncodeError: 'ascii' codec can't encode character u'\uff0c' in position 16: ordinal not in range(128)

Any ideas? Or any other place having the dbNSFP v3_5a or later version for hg19

Thanks!

BoPeng commented 5 years ago

There is a way to remove the readme and java files, but for convenience, I would suggest that you unpack the zip file to a directory, remove all unnecessary files, and then create a test file with something like

head -1000 dbNSFP3.5a_variant.chr1 > test.chr1

and use it to test your .ann file (e.g. vtools use .... --files test.chr1.

If you see u'\uff0c, there might be some non-ascii characters in the annotations, which can be fixed by setting the encoding in the.ann` file if this is the case.

mullinyu commented 5 years ago

Many thanks for advice.

It's a unicode char https://www.compart.com/en/unicode/U+FF0C

I also find many unicode other chars based on the method suggested by https://stackoverflow.com/questions/3001177/how-do-i-grep-for-all-non-ascii-characters grep --color='auto' -P -n "[\x80-\xFF]" *.dbNSFP

I also checked the dnNSFP files generated by vtools based on the zip file downloaded from the official dataset are with ascii encoding file -bi dbNSFP3.5a_variant.chr10.dbNSFP text/plain; charset=us-ascii

Am I right that I should change all the generated dnNSFP files to unicode like suggested by https://www.shellhacks.com/linux-check-change-file-encoding/

[Or actaully, vtools can generate unicode datafile from zip file (official dataset) before loading into the database.]

If I really change all the dbNSFP3.5a_variant.chr[1.....22,X,Y,M].dbNSFP, how should I modify the command? Original: vtools use dbNSFP-hg38_3_5a.ann --files dbNSFPv3.5a.zip

What I should run?

Thanks!

BoPeng commented 5 years ago

Did you try to set encoding in your .ann file as documented here?

mullinyu commented 5 years ago

Yes, I tried with 2 options but got the same error.

  1. encoding=UTF-8
  2. encoding=ISO-8859-1

Here's my ann file

[linked fields] hg38=chr, pos, ref, alt hg19=hg19_chr, hg19_pos, ref, alt

[data sources] description=dbNSFP version 3.5a, maintained by Xiaoming Liu from UTSPH. Please cite: "Liu X, Jian X, and Boerwinkle E. 2011. dbNSFP: a lightweight database of human non-synonymous SNPs and their functional predictions. Human Mutation. 32:894-899. Liu X, Wu C, Li C and Boerwinkle E. 2016. dbNSFP v3.0: A One-Stop Database of Functional Predictions and Annotations for Human Non-synonymous and Splice Site SNVs. Human Mutation. 37(3):235-241." if you find this database useful. preprocessor=Dos2Unix() version=hg38_3_5a anno_type=variant source_url=ftp://dbnsfp:dbnsfp@dbnsfp.softgenetics.com/dbNSFPv3.5a.zip direct_url=anno/dbNSFP-hg38_3_5a.DB.gz 9c5c3f580f32f1f336fa6130ad9a752f source_type=txt encoding=UTF-8

Error:

INFO: Importing annotation data from /home/mullinyu/project/exomedb/.vtools_cache/dbNSFP3.5a.readme.txt.dbNSFP
Process FileReader:
Traceback (most recent call last):
  File "/home/mullinyu/anaconda2/lib/python2.7/multiprocessing/process.py", line 267, in _bootstrap
    self.run()
  File "/home/mullinyu/anaconda2/lib/python2.7/site-packages/variant_tools/importer.py", line 458, in run
    env.logger.debug('Failed to process line "{}...": {}'.format(line[:20].strip(), e))
UnicodeEncodeError: 'ascii' codec can't encode character u'\uff0c' in position 16: ordinal not in range(128)

I am just thinking how the existing annotation database (just having hg38) was created.

mullinyu commented 5 years ago

Hi,

I found the source code set "utf-8" as the default encoding but it seems not used from the error message I got. Therefore, I set it to system-level based on the link https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/

I made changes to the vtool file and no more that error.

import sys
try:
    import argparse
except ImportError:
    sys.exit('variant tools requires Python 2.7.2 or higher, or Python 3.2 '
        'or higher. Please upgrade your version ({}) of Python and try again.'
        .format(sys.version.split()[0]))

from variant_tools.utils import env
import subprocess
# save the command line that has been processed by the shell.
env.command_line = subprocess.list2cmdline(sys.argv[1:])

import variant_tools.project as project
import variant_tools.importer as importer
import variant_tools.update as update
import variant_tools.phenotype as phenotype
import variant_tools.annotation as annotation
import variant_tools.variant as variant
import variant_tools.compare as compare
import variant_tools.exporter as exporter
import variant_tools.pipeline as pipeline

if sys.platform != 'win32':
    import variant_tools.liftOver as liftOver
    import variant_tools.association as association

reload(sys)
sys.setdefaultencoding('utf-8')

def addCommonArgs(parser):

However, I got lots of warnings and errors, probably during loading the data to database (sqlite), e.g.

2019-04-26 18:21:02,385: DEBUG: Failed to process field 0.0;.;0.0;0.0;.: could not convert string to float: .
2019-04-26 18:21:02,385: DEBUG: Failed to process field -1.52;.;-1.52;-1.52;.: could not convert string to float: .
2019-04-26 18:21:02,385: WARNING: Inconsistent base allele G at 9709654 on chromosome 1
2019-04-26 18:21:02,385: DEBUG: Failed to process field 0.0;.;0.0;0.0;.: could not convert string to float: .
2019-04-26 18:21:02,385: DEBUG: Failed to process field -1.42;.;-1.42;-1.42;.: could not convert string to float: .
2019-04-26 18:21:02,386: WARNING: Inconsistent base allele G at 9709654 on chromosome 1
2019-04-26 18:21:02,386: DEBUG: Failed to process field 0.0;.;0.0;0.0;.: could not convert string to float: .
2019-04-26 18:21:02,386: DEBUG: Failed to process field -1.57;.;-1.57;-1.57;.: could not convert string to float: .
2019-04-26 18:21:02,387: DEBUG: Failed to process field 0.0;.;0.0;0.0;.: could not convert string to float: .
2019-04-26 18:21:02,387: DEBUG: Failed to process field -1.57;.;-1.57;-1.57;.: could not convert string to float: .
2019-04-26 18:21:02,387: DEBUG: Failed to process field 0.23;.;0.241;0.23;.: could not convert string to float: .
2019-04-26 18:21:02,387: DEBUG: Failed to process field -0.36;.;-0.33;-0.36;.: could not convert string to float: .
2019-04-26 18:21:02,388: WARNING: Inconsistent base allele C at 9709656 on chromosome 1
2019-04-26 18:21:02,388: DEBUG: Failed to process field 0.225;.;0.232;0.225;.: could not convert string to float: .
2019-04-26 18:21:02,388: DEBUG: Failed to process field -0.29;.;-0.28;-0.29;.: could not convert string to float: .
2019-04-26 18:21:02,388: WARNING: Inconsistent base allele C at 9709656 on chromosome 1
2019-04-26 18:21:02,389: DEBUG: Failed to process field 0.037;.;0.037;0.037;.: could not convert string to float: .
2019-04-26 18:21:02,389: DEBUG: Failed to process field -0.42;.;-0.4;-0.42;.: could not convert string to float: .

Nearly as locations got warnings that I have to terminate the job. Any advice?

Thanks!

mullinyu commented 5 years ago

Or is it possible to create an annotation database of dbNSFP3.5a with both hg38 and hg19 support and put to the web.

I can see there's an annotation database of dbNSFP2.4 with both hg18 and hg19 support on the web https://bioinformatics.mdanderson.org/Software/VariantTools/repository/annoDB/dbNSFP-hg18_hg19_2_4.ann

The official dataset contains hg38, hg19 and hg18 coordinates https://drive.google.com/file/d/0B60wROKy6OqcNGJ2STJlMTJONk0/view

Thanks!

Regards, Mullin

BoPeng commented 5 years ago

Ok. I will have a look today. Could you send me your .ann file?

mullinyu commented 5 years ago

Many thanks! Here is my version. I zip simply for uploading purpose. dbNSFP-hg38_3_5a.zip

BoPeng commented 5 years ago

Thanks. It will take a few days though because it is very time consuming to create large databases in sqlite and build indexes will take even longer.

mullinyu commented 5 years ago

Sure. Many thanks!!

BoPeng commented 5 years ago

Just download the data and tried your annotation file and noticed a bunch of

WARNING: hg19: Inconsistent base allele A at 114358819 on chromosome 2

As far as I can recall, we noticed this problem when we tried to build a hg19/hg38 version of the database. We contacted Dr. Liu, he acknowledged the problem and said that the hg19 index was obtained from some liftover mechanism from hg38, and some indexes are incorrect. This is why we did not use the hg19 indexes of this database.

So why do you want to have a hg19 version of dbNSFP v3_5a? I have not checked the recent version of dbNSFP but it might make more sense to build a database for the 4.0 version. The problem here is that dbNSFP has been getting larger and larger (v3 is 26G compressed) and downloading these files from our server has been a bit of stretching.

mullinyu commented 5 years ago

Thanks for your help.

Yes, because we want to have REVEL score and gnomAD data which are only existing in dbNSFP v3_5a or later. And also, we are using hg19 human genome.

I do agree version 4.0 has much bigger size, so I tried to work on v3_5 before and encountered those problems. It seems maybe the problem of dbNSFTP v3_5a database itself.

Any suggestion? Thanks!

BoPeng commented 5 years ago

Assuming v4 has the correct index for hg19, you could create an .ann file with only the fields you need, which will result in MUCH faster build speed and much smaller DB file. You do not even need to include the hg38 index if you do not need it. We cannot do this because we do not know which fields are useful to users but you can build local databases freely.

So I think the best way to proceed would be:

  1. start with the existing ann file for v3 and hg38
  2. download v4 source file and check if the index of the fields you want to use has been changed. Adjust the indexes of the .ann file if needed.
  3. change the hg38 fields to hg19, and remove all the fields that you do not need
  4. use the modified .ann file to build your hg19-based annotation file.