pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
25 stars 7 forks source link

Use of uninitialized value $fasta_subseq in concatenation (.) or string at generate-fasta-subst-files-BIOPERL.pl line 446, <IN_TX> line 62209. #68

Closed thuynnguyen closed 2 years ago

thuynnguyen commented 2 years ago

Hi pauline-ng,

Can you please help me? I am getting the warning in the subject line repeatedly when attempting to generate a SIFT database for the Asian elephant genome. The program does not successfully generate a database. I believe it doesn't move past running "make-single-records-BIOPERL.pl".

My genome fasta, gtf, and peptide fasta files are downloaded from NCBI here: https://www.ncbi.nlm.nih.gov/assembly/GCF_024166365.1. Attached you will find my config file and the file tree of the parent directory.

Best regards, Thuy

elemax1_config.txt

elemax_tree.txt

pauline-ng commented 2 years ago

A lot of warnings are printed when it runs make-single-records-BIOPERL.pl. I would ignore them. When I look at your list of files, it looks to me it's stopping at the sift4g step.

Looking at your config file, is your SIFT4G_PATH truly /sift4g/bin/sift4g ?

Did you run the test files and were you able to make a SIFT4G database from the test file?

thuynnguyen commented 2 years ago

I just double checked the path for sift4g. And yes it is in /sift4g/bin/sift4g.

I was able to successfully run the test files for both Candidatus carsonella and Homo sapiens chromosomes. These generated databases and high percentage of genes, positions, and predicted SIFT scores in the CHECK_GENES.LOG.

pauline-ng commented 2 years ago

Can you send the sizes of your files in elemax_tree.txt?

Can you also give me:

grep ">" chr-src/asian_elephant.elemax1.fa | head -10

thuynnguyen commented 2 years ago
root@tnn-toolbox:/home/thuynguyen/220830_siftdb_elemax1/SIFT_databases/asian_elephant# grep '>' chr-src/asian_elephant.elemax1.fa |head -10 
>NC_064819.1 Elephas maximus indicus isolate mEleMax1 chromosome 1, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064820.1 Elephas maximus indicus isolate mEleMax1 chromosome 2, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064821.1 Elephas maximus indicus isolate mEleMax1 chromosome 3, mEleMax1 primary haplotype, whole genome shotgun sequence
>NW_026060228.1 Elephas maximus indicus isolate mEleMax1 chromosome 3 unlocalized genomic scaffold, mEleMax1 primary haplotype SUPER_3_unloc_1, whole genome shotgun sequence
>NW_026060229.1 Elephas maximus indicus isolate mEleMax1 chromosome 3 unlocalized genomic scaffold, mEleMax1 primary haplotype SUPER_3_unloc_2, whole genome shotgun sequence
>NC_064822.1 Elephas maximus indicus isolate mEleMax1 chromosome 4, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064823.1 Elephas maximus indicus isolate mEleMax1 chromosome 5, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064824.1 Elephas maximus indicus isolate mEleMax1 chromosome 6, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064825.1 Elephas maximus indicus isolate mEleMax1 chromosome 7, mEleMax1 primary haplotype, whole genome shotgun sequence
>NW_026060230.1 Elephas maximus indicus isolate mEleMax1 chromosome 7 unlocalized genomic scaffold, mEleMax1 primary haplotype SUPER_7_unloc_1, whole genome shotgun sequence

root@tnn-toolbox:/home/thuynguyen/220830_siftdb_elemax1/SIFT_databases/asian_elephant# head -10 gene-annotation-src/protein_coding_genes.txt 
NC_064825.1:-1:112634176:112635094:1:112634176:112635094:XM_049891905.1:LOC126080726::XP_049747862.1::protein_coding
NC_064846.1:-1::0:0:::XR_007515748.1:LOC126068633::::protein_coding
NC_064827.1:1:81216063:81376203:98:81216063,81235777,81237097,81238149,81246412,81247238,81248276,81250993,81253607,81254264,81256067,81257504,81258141,81259008,81259409,81267869,81269197,81269590,81271922,81275550,81275778,81277589,81286626,81289487,81290584,81295135,81296503,81296964,81297574,81300541,81301723,81303867,81304879,81305176,81306234,81306785,81307650,81308677,81310048,81311187,81312753,81313342,81313585,81315679,81316787,81318431,81319840,81320468,81321995,81323060,81323371,81323972,81324826,81326303,81328663,81329202,81330135,81331069,81331392,81332655,81333579,81335281,81337917,81338151,81338399,81338987,81341064,81341902,81342743,81343337,81343511,81344240,81344464,81345361,81346080,81347242,81348242,81350362,81350790,81351314,81352860,81353497,81354602,81355748,81358227,81361014,81365168,81366902,81367136,81367706,81369222,81369408,81370283,81371572,81372222,81373464,81374653,81375857:81216313,81235848,81237256,81238272,81246584,81247345,81248397,81251257,81253752,81254386,81256340,81257646,81258269,81259122,81259559,81268052,81269293,81269718,81272067,81275663,81275938,81277755,81286754,81289742,81290671,81295330,81296634,81297112,81297701,81300696,81302002,81304146,81304988,81305346,81306389,81306882,81307746,81308854,81310206,81311311,81312929,81313439,81313705,81315841,81317069,81318599,81319954,81320555,81322187,81323234,81323488,81324115,81324977,81326522,81328777,81329390,81330232,81331231,81331506,81332858,81333661,81335413,81338061,81338311,81338524,81339171,81341162,81342169,81342928,81343419,81343648,81344382,81344609,81345489,81346237,81347364,81348407,81350470,81350925,81351449,81353051,81353711,81354737,81355939,81358309,81361284,81365318,81367040,81367258,81367944,81369285,81369531,81370418,81371701,81372543,81373584,81374780,81376203:XM_049896977.1:LOC126083338::XP_049752934.1::protein_coding
NC_064837.1:1:66218775:66231367:3:66218775,66221998,66231245:66218867,66222156,66231367:XM_049859437.1:LOC126062206::XP_049715394.1::protein_coding
NC_064823.1:1:156200371:156252615:15:156200371,156200810,156216019,156220656,156222551,156233650,156234495,156234917,156239299,156241325,156243013,156245922,156248257,156250436,156252508:156200387,156200932,156216110,156220804,156222732,156233728,156234576,156235151,156239392,156241477,156243101,156246030,156248431,156250581,156252615:XM_049886505.1:LOC126077303::XP_049742462.1::protein_coding
NC_064821.1:-1:22125038:22125974:1:22125038:22125974:XM_049881421.1:LOC126074089::XP_049737378.1::protein_coding
NC_064822.1:1:108960128:108979914:12:108960128,108960787,108960902,108961496,108963291,108965006,108966212,108970642,108971257,108972233,108973707,108979877:108960197,108960811,108960935,108961613,108963423,108965134,108966300,108970715,108971296,108972289,108973877,108979914:XM_049883294.1:LOC126075491::XP_049739251.1::protein_coding
NC_064838.1:1:47368827:47375645:5:47368827,47370513,47371260,47373522,47374623:47368917,47370640,47371374,47373624,47375645:XM_049862300.1:LOC126063849::XP_049718257.1::protein_coding
NC_064825.1:-1:45700199:45701273:1:45700199:45701273:XM_049890639.1:LOC126079547::XP_049746596.1::protein_coding
NC_064821.1:1:194001105:194042714:5:194001105,194038208,194039336,194040765,194042660:194001266,194038314,194039437,194040961,194042714:XM_049880776.1:LOC126073752::XP_049736733.1::protein_coding
thuynnguyen commented 2 years ago

Apologies for the terrible formatting. Here are the first 10 lines of the headers of the fasta file and the first 10 lines of the protein coding genes txt file.

thuynnguyen commented 2 years ago

Here is the ls -lh output of the tree elemax_tree_info.txt

thuynnguyen commented 2 years ago

Thank you so much for taking the time to help me!

pauline-ng commented 2 years ago

Assuming the files in singleRecords/ are not empty, it looks like everything is being generated.

Maybe you didn't wait long enough for SIFT to run? SIFT takes time -- so if you want to check things are running, you couldcd singleRecords; ls -lth and see that a file is being updated. SIFT can take ~24 hours for 1 genome, also depends on your CPU/GPU.

Your file structure and contents look good to me--- based on what I've seen so far, I would say you just didn't wait long enough. Can you try running it again and wait 48 hours before checking the results?

thuynnguyen commented 2 years ago

Thank you for the handy command. It looks like the single records files are not empty.

I have a really dumb question though, are you asking me to wait 48 hours after seeing "All done" printed out in the terminal window? Should I leave my VM open for that long after the program is done?

Either way, I'll go ahead and run it again, wait, and then get back to you.

pauline-ng commented 2 years ago

Yes. I'm expecting it to be done after 24 hours, but I don't know how powerful your GPU is, so 48 hours just to be sure.

The sample files are meant to run fast to test everything is in place. hen running whole genomes it would take us 24 hours per genome on our hardware.

thuynnguyen commented 2 years ago

Ok thank you so much for clarifying and thank you again for taking the time to help me. I've restarted the run a couple hours ago. Fingers crossed I have a database by the end of the weekend.

thuynnguyen commented 2 years ago

Hi Pauline, I hope you had a good weekend. So it turns out that sift4g fails because of insufficient disk space. My VM doesn't have any gpu, only 16 vCPU and 16gb memory. What specifications do you recommend?

pauline-ng commented 2 years ago

When you compiled SIFT, did you do

make

or

make gpu

You should have used make since you don't have a GPU. Instructions are here.

thuynnguyen commented 2 years ago

I used make

pauline-ng commented 2 years ago

Please open issue at : https://github.com/rvaser/sift4g, maybe Robert can help.

Alternatively, you can break the gtf file into parts by chromosome, and try running each chromosome separately?

Run

rm fasta/*
rm all_prot.fasta 

between each chromosome run so it'll be restricted to the proteins for that chromosome.