pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
21 stars 7 forks source link

Issue annotating VCF files with SIFT database #64

Closed frankiefattorini closed 1 year ago

frankiefattorini commented 1 year ago

Hello,

I've created a SIFT database for chromosome 3 of Dictyostelium discoideum using fasta and gff (converted to gtf) files. Running more CHECK_GENES.LOG gave the output: Chr Genes with SIFT Scores Pos with SIFT scores Pos with Confident Scores DDB0232430 83 (3798/4567) 91 (14959896/16519059) 74(10997369/14959896) When attempting to annotate vcf files containing SNPs within my genes of interest, however, I received the output: Started Running ....... Running in Single transcripts mode Chromosome WithSIFT4GAnnotations WithoutSIFT4GAnnotations Progress test_files/output/dicty_2.7/DDB0232430.regions does not exist DDB0232430 0 450 Completed : 1/1 Both of my genes of interest seem to have been included in the various files produced for the alignment. I did not include the vcf file during in the 'dbSNP' file during the database creation process, so could this be the issue? If so, is there a way to create the SIFT database using several VCF files or a merged VCF file containing SNP information for different strains of an organism with different SNPs within the same gene sequence, as I wish to look at a few hundred VCF files and the SNP effects on protein structure. Thank you

pauline-ng commented 1 year ago

Hi,

dbSNP VCF file is optional. Weird-- so your database has been built.

When you ls test_files/output/dicty_2.7, what do you see? Do you see the regions file?

Also, I'm not familiar with Dictyosteliad discoideum. In your VCF file, the chromosome is DDB0232430 ? Or under the #CHROM column, does it say 3 or chr3?

Best, Pauline

frankiefattorini commented 1 year ago

Hi,

with ls test_files/output/dicty_2.7 I get CHECK_GENES.LOG DDB0232430.gz dictyostelium_discoideum_config.txt. The chromosome is DDB0232430 yes, not 3 or chr3.

pauline-ng commented 1 year ago

Very odd. Can you list the directories and their contents? Specifically, I'm interested in

Also, when please confirm zcat test_files/output/dicty_2.7/DDB0232430.gz shows predictions (it's not an empty file)

In the folder test_files/output/dicty_2.7, there should also be a DDB0232430_SIFTDB_stats.txt and a DDB0232430.regions, not sure why these files are missing.

frankiefattorini commented 1 year ago

I can't find directories with those names but within test_files/output there are directories 'singleRecords' and 'singleRecords_with_scores'. singleRecords_with_scores is empty while singleRecords contains: DDB0232430.invalidRecords DDB0232430.singleRecords DDB0232430.singleRecords_noncoding DDB0232430.singleRecords_noncoding.with_dbSNPid DDB0232430.singleRecords_proteins.fa Running zcat test_files/output/dicty_2.7/DDB0232430.gz does show predictions.

frankiefattorini commented 1 year ago

I'm not sure if this is relevant but Dictyostelium discoideum is a haploid organism. However, the region within which my genes of interest are located looks like it may have undergone a duplication in some strains, so vcf files for these strains contian apparently heterozygous SNPs. Could this pose a problem?

pauline-ng commented 1 year ago

I don't think that would pose a problem because SIFT works on plenty of diploid organisms. We tested it and used it extensively on humans.

The command perl make-sift-scores-db-batch.pl $metafile did not complete for some reason. This program creates the region file and the stats file. It's a super simple script, not sure why it didn't finish.

Unfortunately, I don' think you can just run the command again because there was some cleanup at the end, so input files are probably missing.

Can you rerun making the database, but comment out the lines that remove files (it'll help me debug)

comment out the following lines: line 157 in make-SIFT-db-all.pl:
Original: system ("rm $rm_dir");
New: # system ("rm $rm_dir");

lines 69 -70 in make-sift-scores-db-batch.pl Add "#" in front of the system ("rm ...."); commands

Then go ahead and remake the database.

Assuming it still doesn't finish, I would like to know the stdout of perl make-sift-scores-db-batch.pl $metafile

frankiefattorini commented 1 year ago

Thank you. Just to check, when you say comment out do you mean I should edit the text of the files where you specify, before rerunning the database making script?

pauline-ng commented 1 year ago

Yes, that's right. the "#" will make these lines comments, so they won't run, the files won't be removed, they will be kept.

I want you to keep all the files, it makes it easier to rerun parts of SIFT and debug.

Hope that's clear.

frankiefattorini commented 1 year ago

I've rerun the database creation after turning those rm lines into comments, then I ran perl make-sift-scores-db-batch.pl $metafile. The output was: Executing perl map-scores-back-to-records.pl test_files/dictyostelium_discoideum_config.txt DDB0232430 Executing perl match-replace-dbSNP-vcf.pl test_files/dictyostelium_discoideum_config.txt DDB0232430 db file test_files/output/singleRecords_with_scores/DDB0232430_scores.Srecords.with_dbSNPid noncoding test_files/output/singleRecords/DDB0232430.singleRecords_noncoding.with_dbSNPidfinal outfolder is test_files/output/dicty_2.7 python make_regions_file.py test_files/output/singleRecords_with_scores/DDB0232430_scores.Srecords.with_dbSNPid.sorted test_files/output/dicty_2.7/DDB0232430.regions Can't exec "python": No such file or directory at make-sift-scores-db-batch.pl line 66. Built single records test_files/output/singleRecords

frankiefattorini commented 1 year ago

I believe I've solved the issue. I've been using Ubuntu as a Linux portal for Windows and it seems Ubuntu uses the python3 command rather than the python command. As a result, make-sift-scores-db-batch.pl could not excecute python. I installed python-is-python3 and reran the VCF annotation. Now the output is: Chromosome WithSIFT4GAnnotations WithoutSIFT4GAnnotations Progress DDB0232430 450 0 Completed : 1/1 Thank you for your help

pauline-ng commented 1 year ago

Excellent!