pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

Can not build my own SIFT database #69

Closed XIAO2Mark closed 2 years ago

XIAO2Mark commented 2 years ago

Hi,

I have tried to use the SIFT4G to build my own SIFT database for many times, but it still was unsuccessful. can you help me to check it? The details as bellow:

Searching database for candidate sequences

processing database part 1 (size ~0.25 GB): 100.00/100.00% * Aligning queries with candidate sequences

processing database part 1 (size ~1.00 GB): 100.00/100.00% * Selecting alignments with median threshold: 2.75

processing queries: 100.00/100.00% * Generating SIFT predictions with sequence identity: 100.00%

processing queries: 100.00/100.00% * done getting all the scores populating databases cat: /home/scripts_to_build_SIFT_db/test_files/singleRecords/Chr1.singleRecords: No such file or directory can't open /home/scripts_to_build_SIFT_db/test_files/singleRecords/Chr1.singleRecords at map-scores-back-to-records.pl line 122. Unable to read from /home/scripts_to_build_SIFT_db/test_files/singleRecords_with_scores/Chr1_scores.Srecords cat: /home/scripts_to_build_SIFT_db/test_files/singleRecords/Chr1.singleRecords_noncoding.with_dbSNPid: No such file or directory Traceback (most recent call last): File "make_regions_file.py", line 68, in get_regions (chrom_file, out_file) File "make_regions_file.py", line 31, in get_regions pos = get_pos (first_line) File "make_regions_file.py", line 8, in get_pos return int (fields[0]) ValueError: invalid literal for int() with base 10: '' rm: cannot remove '/home/scripts_to_build_SIFT_db/test_files/singleRecords_with_scores/Chr1_scores.Srecords': No such file or directory

many thx.

pauline-ng commented 2 years ago

Are you able to make a database from the test file?

If yes, then can you paste below

XIAO2Mark commented 2 years ago

Hi Ng,

thank you so much!

No, i can not make the database when i use the example file. Details as bellow,


done making the fasta sequences start siftsharp, getting the alignments cat: './test_files/homo_sapiens_small/fasta/*.fasta': No such file or directory /bigdrive/sift4g/bin/sift4g -d /bigdrive/SIFT_databases/uniprot_sprot.fasta -q ./test_files/homo_sapiens_small/all_prot.fasta --subst ./test_files/homo_sapiens_small/subst --out ./test_files/homo_sapiens_small/SIFT_predictions --sub-results

Additionally, the PROTEIN_DB is the file that I download from UniProt database (wget https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz)

Running SIFT 4G

SIFT4G_PATH=/home/bin/sift4g PROTEIN_DB=/SIFT_databases/uniprot_sprot.fasta

could you pls help me to check it again? Many thx.

pauline-ng commented 2 years ago

In the config file, please change to full paths. SIFT does not work with relative paths.

XIAO2Mark commented 2 years ago

thx. i followed your suggestions but it still does not work.

done making the fasta sequences start siftsharp, getting the alignments cat: '/home/SIFT/scripts_to_build_SIFT_db/ET/fasta/*.fasta': No such file or directory Checking query data and substitutions files EXITING! No valid queries to process.

pauline-ng commented 2 years ago

This is the output of the test file? Can you list the files that were generated in the test file directory and their sizes?

Also, please resend your config file.

XIAO2Mark commented 2 years ago

yes, the details as bellow,


homo_sapiens_small/* -shc 0 homo_sapiens_small/all_prot.fasta 46M homo_sapiens_small/chr-src 25M homo_sapiens_small/dbSNP 4.0K homo_sapiens_small/fasta 0 homo_sapiens_small/fasta.log 13M homo_sapiens_small/gene-annotation-src 4.0K homo_sapiens_small/GRCh38.83 0 homo_sapiens_small/invalid.log 0 homo_sapiens_small/Log2.txt 0 homo_sapiens_small/peptide.log 4.0K homo_sapiens_small/SIFT_alignments 4.0K homo_sapiens_small/SIFT_predictions 4.0K homo_sapiens_small/singleRecords 4.0K homo_sapiens_small/singleRecords_with_scores 4.0K homo_sapiens_small/subst 83M total


here is the config file. thx homo_sapiens-test.txt

pauline-ng commented 2 years ago

can you list all files & their sizes in each directory?

XIAO2Mark commented 2 years ago

├── [ 839] arabidopsis_config.txt ├── [1.3K] candidatus_carsonella_ruddii_pv_config.txt ├── [4.0K] homo_sapiens_small │   ├── [ 0] all_prot.fasta │   ├── [4.0K] chr-src │   │   ├── [ 12K] directory.index │   │   ├── [ 45M] Homo_sapiens.GRCh38.dna.chromosome.21.fa │   │   └── [ 17K] Homo_sapiens.GRCh38.dna.chromosome.MT.fa │   ├── [4.0K] dbSNP │   │   └── [ 24M] Homo_sapiens_trimmed.vcf.gz │   ├── [4.0K] fasta │   ├── [ 0] fasta.log │   ├── [4.0K] gene-annotation-src │   │   ├── [514K] Homo_sapiens.GRCh38.83_trimmed.gtf.gz │   │   └── [ 12M] Homo_sapiens.GRCh38.pep.all.fa.gz │   ├── [4.0K] GRCh38.83 │   ├── [ 0] invalid.log │   ├── [ 0] Log2.txt │   ├── [ 0] peptide.log │   ├── [4.0K] SIFT_alignments │   ├── [4.0K] SIFT_predictions │   ├── [4.0K] singleRecords │   ├── [4.0K] singleRecords_with_scores │   └── [4.0K] subst ├── [ 883] homo_sapiens-test.txt └── [1.3K] saccharomyces_cerevisiae-template.txt

pauline-ng commented 2 years ago

That's weird, it's not even going through the first step. Can you paste everything that shows up on the terminal when you run the command?

pauline-ng commented 2 years ago

And I assume you have perl and python installed?

XIAO2Mark commented 2 years ago

yes, i have installed perl and python. i copied the genome sequence (genome.fa) to the file (/home/SIFT/scripts_to_build_SIFT_db/ET/fasta/*.fasta'.), and now it's working. I am now waiting for the final results.

pauline-ng commented 2 years ago

Great!

XIAO2Mark commented 2 years ago

Many thx, Ng. it is still running now. May i ask if the file in ''homo_sapiens_small/fasta/'' folder is the protein sequences or genome sequences? Also, the file all_prot.fasta represented the protein sequences, right?

many thanks.

pauline-ng commented 2 years ago

chr-src/ should contain DNA sequence of the genome fasta/ should contain protein sequence. This is generated from chr-src and gene-annotation-src all_prot.fasta should be all of the protein sequences in the genome. It comes from combining all of the protein sequences in fasta/