pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
21 stars 7 forks source link

building database #63

Closed pauline-ng closed 1 year ago

pauline-ng commented 1 year ago

Email from user

I went through the tutorial on github and created my own database for mus_musculus GRCm39.106. I then compared it to the one you have online (GRCm38.83), and I noticed that:

  1. for most chromosomes the newer version had heavier files, which I think makes sense because they updated the genome
  2. In the one online, there are a total of 137 files in the folder while in the one I created there are only 119
  3. When running SIFT4g with the new database on previously ran samples, I get fewer hits. For example, in one VCF I tested, the old database resulted in 53 sites in the output while the new one was only 39

I have attached screenshots of the errors I got during database creation. It seems that some files (not sure what these are), weren't in the directory, so couldn't be processed. I also compared the "CHECK_GENES.LOG" file and there are six more of these GL or JH files, so that accounts for the 18 missing files as each records gets three files. My questions are:

  1. what are these files and how are they important?
  2. Any idea why I am missing six from the new assembly?

One suspicion I had was that my internet connection was too slow when downloading the ensembl files that maybe some were left out. I am planning to repeat at a place where I have better internet to test this. But I also wanted to check if you had any additional ideas.

pauline-ng commented 1 year ago

Not downloading all the chromosomes would explain the results you are saying.

I think you can download the files manually and once you make sure the files are complete (either by comparing sizes online or checksums), then re-run SIFT again. It should not redownload the files, but start running SIFT.

If you do this, remove these files and folders

rm all_prot.fasta
rm fasta/*
rm gene-annotation-src/protein_coding_genes.txt
rm gene-annotation-src/noncoding.txt
rm SIFT_predictions/*
rm singleRecords/*
rm singleRecords_with_scores/*
rm subst/*

Only chr-src, dbSNP, andgene-annotation-src/*gtf.gz should be present when you re-run SIFT.