pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

SIFT4G annotator #73

Closed yaada100 closed 1 year ago

yaada100 commented 1 year ago

Dear Pauline,

I have wanted to use SIFT4G on the 11th chromosome of homo sapiens. To do that I am in need of a proper gene annotation file, so I used as suggested the SIFT4G Annotator. The files I've used are basically the provided Database (https://sift.bii.a-star.edu.sg/sift4g/public/Homo_sapiens/GRCh38.83.chr/11.gz), and as for the .vcf file, I used variants provided by clinvar, trimmed for only the 11th chromosome.

I will include the vcf file (in two halfs and .txt) as a attachment.

To get the gene annotation file, I have used the .jar graphical board as well, as the command option on linux (java -jar SIFT4G_Annotator.jar -c -i annotator/eleven.vcf -d 11 -r annotator -t)

The procedure results in empty files and such an answer:

Started Running ....... Running in Multitranscripts mode Chromosome WithSIFT4GAnnotations WithoutSIFT4GAnnotations Progress The following chromosomes (or scaffolds/contigs) are not found in the SIFT 4G database and will not be annotated: 11 Please contact us if you have any questions. eleven/11.regions does not exist 11 0 92107 Completed : 1/1 Merging temp files.... SIFT4G Annotation completed ! Output directory:annotator End Time for parallel code: Wed Dec 14 13:24:00 UTC 2022

eleven_vcf.txt_1.txt eleven_vcf.txt_2.txt

It seems like my variant file is the problem, do you have a suggestion for me how to acquire the proper file?

Best regards, Yakup

pauline-ng commented 1 year ago

Did you download the regions file in the same folder as the 11.gz file?

https://sift.bii.a-star.edu.sg/sift4g/public/Homo_sapiens/GRCh38.83.chr/11.regions

yaada100 commented 1 year ago

Yes, I did.

yaada100 commented 1 year ago

Hello Pauline,

I tried to bypass my problem by directly using SIFT4G. I acquired the necessary files of NCBI Assembly on the full Homo Sapiens genome (GRCh38.p13). https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/

However the SIFT algorithm results in such an output and no database:

Use of uninitialized value $coord in concatenation (.) or string at make-single-records-BIOPERL.pl line 576. Use of uninitialized value $exon_num in concatenation (.) or string at make-single-records-BIOPERL.pl line 576. Argument "" isn't numeric in addition (+) at make-single-records-BIOPERL.pl line 524. These lines are repeated multiple times in blocks

As files i have used : chr-src - the Genomic fasta sequence gene-annotation-src - the Genomic gtf file dbSNP - freebayes -f genomic.fa genomic.bam > GRCh38.vcf The bam file has also been acquired by the same assembly.

Can you help me in locating the problem? Thank you in advance for your opinion and help.

Best regards, Yakup

pauline-ng commented 1 year ago

Going back to your initial issue, your commandline:

java -jar SIFT4G_Annotator.jar -c -i annotator/eleven.vcf -d 11 -r annotator -t

is incorrect. the -d refers to the path to the SIFT4G database directory, not the file or chromosome (The commandline would only be correct if you have a directory called 11/ ).

Also, please use full paths.

yaada100 commented 1 year ago

Yes I do, the database directory is called "11". Excuse the misunderstanding.

pauline-ng commented 1 year ago

Can you 'ls' your SIFT database path to list the files and their sizes? Can you also give me your command with the full path?

Also, did you add a suffix .txt to upload the file to github, or is your VCF file actually ending in .txt? VCF files should end in .vcf, if not, please rename and trying it again.

yaada100 commented 1 year ago

Hello Pauline,

excuse the late response. Of course, I attached the ls output into the message.

The command looked like this : nohup perl make-SIFT-db-all.pl -config ../test_data/test_config.txt &

No the .vcf file has not been renamed it is ending with .vcf. It was just a .txt version, so it can be attached on Github. However the .vcf file I did use is not the one included in here. I used the vcf file included on NCBI Assembly for GRCh38.p13, their genome file and a gtf file aquired from its gff3 file on the Assembly.

https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/

image

pauline-ng commented 1 year ago

If GRCh38.13 is empty, the database was not built correctly.

I would like to stick to one problem at a time, there is no reason you can't use https://sift.bii.a-star.edu.sg/sift4g/public/Homo_sapiens/GRCh38.83.chr

Please provide the

1) full command to java -jar SIFT4G_Annotator.jar using full paths 2) files & their sizes for the directory passed in -d

Also, you wrote: " I used the vcf file included on NCBI Assembly for GRCh38.p13"

I'm not sure what this means, if you have a path that would be helpful.

We are going to ignore perl make-SIFT-db-all.pl and all of that because the SIFT database is already available. It's the annotation that's giving an issue.

yaada100 commented 1 year ago

Thank you for your fast response. My purpose on using SIFT is to inspect the SIFT_alignment and SIFT_predictions files, with possible changes. That why I am using SIFT4G. The annotator does not deliver these as output, does it? So that issue is for my approach probably not relevant.

My main issue is with the output of the SIFT4G computation.

NCBI assembly provides a number of files relating to a specific version of a genome assembly.

My files: vcf file: I aquired the .bam file on NCBI and used freebayes to get the .vcf file. nohup freebayes -f parent_dir/chr-src/GCF_000001405.39_GRCh38.p13_genomic.fna parent_dir/dbSNP/GCF_000001405.39_GRCh38.p13_knownrefseq_alns.bam > GRCh38_p13.vcf &

gtf file: Genomic GTF on NCBI Assembly

chr-src: Genomic Fasta on NCBI Assembly

All files are provided here: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.39/ By clicking on Download Assembly (top right corner), a database can be chosen(RefSeq) and the files, that I have referenced to.

ncbi_assembly

Thanks for your attention. I’m looking forward to your helpful advice.

pauline-ng commented 1 year ago

I'm not sure why you'd need a bam file.

Can you build the sample human database? As stated above, if GRCh38.13 is empty, the database was not built correctly.

yaada100 commented 1 year ago

I had no vcf file, so I converted the bam file into a vcf file. Yes the test sample works for me.

Do you have a Suggestion on why it is not working?

pauline-ng commented 1 year ago

There are 2 steps here.

  1. You want to build the database without the protein alignments. Because your GRCh38.13 is empty, this means you haven't done this yet.

  2. Processing your VCF file. Are you sure your VCF file has coding variants?

yaada100 commented 1 year ago

to 1. How would I include the protein alignments? Does SIFT4Gs algorithm not include the protein alignment step?

to 2. The rows in my vcf file include information on wether the variant is protein coding or not. If that is what you meant.

pauline-ng commented 1 year ago
  1. You don't include protein alignments. Your are not correctly building the database. I don't know why, I'm just informing you this step isn't complete.

  2. No it doesn't need to include the annotation, but it does need to contain coding variants otherwise there's nothing to annotate.

pauline-ng commented 1 year ago

Closed due to inactivity.