Closed pauline-ng closed 2 years ago
Not downloading all the chromosomes would explain the results you are saying.
I think you can download the files manually and once you make sure the files are complete (either by comparing sizes online or checksums), then re-run SIFT again. It should not redownload the files, but start running SIFT.
If you do this, remove these files and folders
rm all_prot.fasta
rm fasta/*
rm gene-annotation-src/protein_coding_genes.txt
rm gene-annotation-src/noncoding.txt
rm SIFT_predictions/*
rm singleRecords/*
rm singleRecords_with_scores/*
rm subst/*
Only chr-src
, dbSNP,
andgene-annotation-src/*gtf.gz
should be present when you re-run SIFT.
Email from user
I went through the tutorial on github and created my own database for mus_musculus GRCm39.106. I then compared it to the one you have online (GRCm38.83), and I noticed that:
I have attached screenshots of the errors I got during database creation. It seems that some files (not sure what these are), weren't in the directory, so couldn't be processed. I also compared the "CHECK_GENES.LOG" file and there are six more of these GL or JH files, so that accounts for the 18 missing files as each records gets three files. My questions are:
One suspicion I had was that my internet connection was too slow when downloading the ensembl files that maybe some were left out. I am planning to repeat at a place where I have better internet to test this. But I also wanted to check if you had any additional ideas.