Closed thuynnguyen closed 2 years ago
A lot of warnings are printed when it runs make-single-records-BIOPERL.pl. I would ignore them. When I look at your list of files, it looks to me it's stopping at the sift4g step.
Looking at your config file, is your SIFT4G_PATH truly /sift4g/bin/sift4g
?
Did you run the test files and were you able to make a SIFT4G database from the test file?
I just double checked the path for sift4g. And yes it is in /sift4g/bin/sift4g.
I was able to successfully run the test files for both Candidatus carsonella and Homo sapiens chromosomes. These generated databases and high percentage of genes, positions, and predicted SIFT scores in the CHECK_GENES.LOG.
Can you send the sizes of your files in elemax_tree.txt?
Can you also give me:
First 10 lines of protein_coding_genes.txt
First 10 chromosome header lines of the genome:
grep ">" chr-src/asian_elephant.elemax1.fa | head -10
root@tnn-toolbox:/home/thuynguyen/220830_siftdb_elemax1/SIFT_databases/asian_elephant# grep '>' chr-src/asian_elephant.elemax1.fa |head -10
>NC_064819.1 Elephas maximus indicus isolate mEleMax1 chromosome 1, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064820.1 Elephas maximus indicus isolate mEleMax1 chromosome 2, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064821.1 Elephas maximus indicus isolate mEleMax1 chromosome 3, mEleMax1 primary haplotype, whole genome shotgun sequence
>NW_026060228.1 Elephas maximus indicus isolate mEleMax1 chromosome 3 unlocalized genomic scaffold, mEleMax1 primary haplotype SUPER_3_unloc_1, whole genome shotgun sequence
>NW_026060229.1 Elephas maximus indicus isolate mEleMax1 chromosome 3 unlocalized genomic scaffold, mEleMax1 primary haplotype SUPER_3_unloc_2, whole genome shotgun sequence
>NC_064822.1 Elephas maximus indicus isolate mEleMax1 chromosome 4, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064823.1 Elephas maximus indicus isolate mEleMax1 chromosome 5, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064824.1 Elephas maximus indicus isolate mEleMax1 chromosome 6, mEleMax1 primary haplotype, whole genome shotgun sequence
>NC_064825.1 Elephas maximus indicus isolate mEleMax1 chromosome 7, mEleMax1 primary haplotype, whole genome shotgun sequence
>NW_026060230.1 Elephas maximus indicus isolate mEleMax1 chromosome 7 unlocalized genomic scaffold, mEleMax1 primary haplotype SUPER_7_unloc_1, whole genome shotgun sequence
root@tnn-toolbox:/home/thuynguyen/220830_siftdb_elemax1/SIFT_databases/asian_elephant# head -10 gene-annotation-src/protein_coding_genes.txt
NC_064825.1:-1:112634176:112635094:1:112634176:112635094:XM_049891905.1:LOC126080726::XP_049747862.1::protein_coding
NC_064846.1:-1::0:0:::XR_007515748.1:LOC126068633::::protein_coding
NC_064827.1:1:81216063:81376203:98:81216063,81235777,81237097,81238149,81246412,81247238,81248276,81250993,81253607,81254264,81256067,81257504,81258141,81259008,81259409,81267869,81269197,81269590,81271922,81275550,81275778,81277589,81286626,81289487,81290584,81295135,81296503,81296964,81297574,81300541,81301723,81303867,81304879,81305176,81306234,81306785,81307650,81308677,81310048,81311187,81312753,81313342,81313585,81315679,81316787,81318431,81319840,81320468,81321995,81323060,81323371,81323972,81324826,81326303,81328663,81329202,81330135,81331069,81331392,81332655,81333579,81335281,81337917,81338151,81338399,81338987,81341064,81341902,81342743,81343337,81343511,81344240,81344464,81345361,81346080,81347242,81348242,81350362,81350790,81351314,81352860,81353497,81354602,81355748,81358227,81361014,81365168,81366902,81367136,81367706,81369222,81369408,81370283,81371572,81372222,81373464,81374653,81375857:81216313,81235848,81237256,81238272,81246584,81247345,81248397,81251257,81253752,81254386,81256340,81257646,81258269,81259122,81259559,81268052,81269293,81269718,81272067,81275663,81275938,81277755,81286754,81289742,81290671,81295330,81296634,81297112,81297701,81300696,81302002,81304146,81304988,81305346,81306389,81306882,81307746,81308854,81310206,81311311,81312929,81313439,81313705,81315841,81317069,81318599,81319954,81320555,81322187,81323234,81323488,81324115,81324977,81326522,81328777,81329390,81330232,81331231,81331506,81332858,81333661,81335413,81338061,81338311,81338524,81339171,81341162,81342169,81342928,81343419,81343648,81344382,81344609,81345489,81346237,81347364,81348407,81350470,81350925,81351449,81353051,81353711,81354737,81355939,81358309,81361284,81365318,81367040,81367258,81367944,81369285,81369531,81370418,81371701,81372543,81373584,81374780,81376203:XM_049896977.1:LOC126083338::XP_049752934.1::protein_coding
NC_064837.1:1:66218775:66231367:3:66218775,66221998,66231245:66218867,66222156,66231367:XM_049859437.1:LOC126062206::XP_049715394.1::protein_coding
NC_064823.1:1:156200371:156252615:15:156200371,156200810,156216019,156220656,156222551,156233650,156234495,156234917,156239299,156241325,156243013,156245922,156248257,156250436,156252508:156200387,156200932,156216110,156220804,156222732,156233728,156234576,156235151,156239392,156241477,156243101,156246030,156248431,156250581,156252615:XM_049886505.1:LOC126077303::XP_049742462.1::protein_coding
NC_064821.1:-1:22125038:22125974:1:22125038:22125974:XM_049881421.1:LOC126074089::XP_049737378.1::protein_coding
NC_064822.1:1:108960128:108979914:12:108960128,108960787,108960902,108961496,108963291,108965006,108966212,108970642,108971257,108972233,108973707,108979877:108960197,108960811,108960935,108961613,108963423,108965134,108966300,108970715,108971296,108972289,108973877,108979914:XM_049883294.1:LOC126075491::XP_049739251.1::protein_coding
NC_064838.1:1:47368827:47375645:5:47368827,47370513,47371260,47373522,47374623:47368917,47370640,47371374,47373624,47375645:XM_049862300.1:LOC126063849::XP_049718257.1::protein_coding
NC_064825.1:-1:45700199:45701273:1:45700199:45701273:XM_049890639.1:LOC126079547::XP_049746596.1::protein_coding
NC_064821.1:1:194001105:194042714:5:194001105,194038208,194039336,194040765,194042660:194001266,194038314,194039437,194040961,194042714:XM_049880776.1:LOC126073752::XP_049736733.1::protein_coding
Apologies for the terrible formatting. Here are the first 10 lines of the headers of the fasta file and the first 10 lines of the protein coding genes txt file.
Here is the ls -lh output of the tree elemax_tree_info.txt
Thank you so much for taking the time to help me!
Assuming the files in singleRecords/ are not empty, it looks like everything is being generated.
Maybe you didn't wait long enough for SIFT to run? SIFT takes time -- so if you want to check things are running, you couldcd singleRecords; ls -lth
and see that a file is being updated. SIFT can take ~24 hours for 1 genome, also depends on your CPU/GPU.
Your file structure and contents look good to me--- based on what I've seen so far, I would say you just didn't wait long enough. Can you try running it again and wait 48 hours before checking the results?
Thank you for the handy command. It looks like the single records files are not empty.
I have a really dumb question though, are you asking me to wait 48 hours after seeing "All done" printed out in the terminal window? Should I leave my VM open for that long after the program is done?
Either way, I'll go ahead and run it again, wait, and then get back to you.
Yes. I'm expecting it to be done after 24 hours, but I don't know how powerful your GPU is, so 48 hours just to be sure.
The sample files are meant to run fast to test everything is in place. hen running whole genomes it would take us 24 hours per genome on our hardware.
Ok thank you so much for clarifying and thank you again for taking the time to help me. I've restarted the run a couple hours ago. Fingers crossed I have a database by the end of the weekend.
Hi Pauline, I hope you had a good weekend. So it turns out that sift4g fails because of insufficient disk space. My VM doesn't have any gpu, only 16 vCPU and 16gb memory. What specifications do you recommend?
When you compiled SIFT, did you do
make
or
make gpu
You should have used make
since you don't have a GPU. Instructions are here.
I used make
Please open issue at : https://github.com/rvaser/sift4g, maybe Robert can help.
Alternatively, you can break the gtf file into parts by chromosome, and try running each chromosome separately?
Run
rm fasta/*
rm all_prot.fasta
between each chromosome run so it'll be restricted to the proteins for that chromosome.
Hi pauline-ng,
Can you please help me? I am getting the warning in the subject line repeatedly when attempting to generate a SIFT database for the Asian elephant genome. The program does not successfully generate a database. I believe it doesn't move past running "make-single-records-BIOPERL.pl".
My genome fasta, gtf, and peptide fasta files are downloaded from NCBI here: https://www.ncbi.nlm.nih.gov/assembly/GCF_024166365.1. Attached you will find my config file and the file tree of the parent directory.
Best regards, Thuy
elemax1_config.txt
elemax_tree.txt