pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
22 stars 7 forks source link

make-single-records error #15

Closed sloux closed 4 years ago

sloux commented 4 years ago

Hi Pauline,

I'm having a few more issues with the script. I'm getting the following error: It will continue to run, but ends up ultimately failing. Thanks for your help!

converting gene format to use-able input done converting gene format making single records file

Possible precedence issue with control flow operator at /opt/ohpc/pub/libs/gnu8/ccs/perl/5.30.0/lib/site_perl/5.30.0/Bio/DB/ line 845. Subroutine Bio::DB::IndexedBase::_strip_crnl redefined at /opt/ohpc/pub/libs/gnu8/ccs/perl/5.30.0/lib/site_perl/5.30.0/Bio/DB/ line 304. Use of uninitialized value $aa in string eq at line 274. Use of uninitialized value $aa in concatenation (.) or string at line 280.

With an ultimate failure at: Aligning queries with candidate sequences

pauline-ng commented 4 years ago

@sloux -- Can you check that all your protein sequences are real amino acids? (no stops * or X)

cc-ing @rvaser so he's aware of the SIFT4G message

Thanks, Pauline

sloux commented 4 years ago

Looking through the protein database, at first glance, all appear to be real amino acids.

Looking more closely, there are no * in the database; however, there are 74 proteins which contain at least 3 X's (out of 44,934 proteins). When I run the code, I get 1,214 of the "Use of initialized value ... at line 274".

The strangest part is that running the code different times results in a different number of errors. An earlier running gave me 8 of those errors, but also had the following error 10 times:

Use of uninitialized value $orig_aa in string eq at line 551. Use of uninitialized value $mutated_aa in string eq at line 551.

sloux commented 4 years ago

I just wanted to add that I removed all proteins containing an aa value of X from my protein file and am still getting the same error.

Use of uninitialized value $aa in string eq at line 274. Use of uninitialized value $aa in concatenation (.) or string at line 280.

I'm using the most recent version of perl, if that helps with troubleshooting (5.30.0).

pauline-ng commented 4 years ago

Hi @sloux ,

The uninitialized value means that some proteins failed to translate -- I ignore these warnings and am able to build a database.

Some troubleshooting tips:

If you ran this more than once without deleting the intermediate files, then this can cause a problem. There is a step in the script that appends to an existing file -- if the script was run more than once, then that file will have duplicate sequences that will cause problems with the SIFT 4G algorithm.

To check this, find the file $meta_hash{"PARENT_DIR"}/all_prot.fasta

can you do grep ">" | sort | uniq -d

If you see duplicates, then that means that file has duplicates which is incorrect.
Incidentally, all_prot.fasta is the file that cannot have X's in the protein sequences.

If all_prot.fasta does contain duplicates, then delete all the files in the folder, read instructions, and start fresh.

If it does not have duplicates, then run the script attached. I took and removed all steps prior to the SIFT 4G command, and it calls the SIFT 4G algorithm from the start. We can troubleshoot better from there.

Thanks, Pauline

sloux commented 4 years ago

Thanks Pauline,

I did have duplicates in the all_prot.fasta file, so I deleted all of the files in the folder except the original files and restarted. This genome (EquCab3.0) failed during the original run because of a lack of memory, so I restarted it on a higher memory partition. All comments so far (10 minutes) appear to be normal.

I also created SIFT database for the current bovine genome (ARS-USDA1.2), which ran successfully except for the contigs from chromosome Un, so I am hopeful that this will solve the problem.

sloux commented 4 years ago

The script failed again. It had 5296 lines of the same types of error as before, but continued until the " Aligning queries with candidate sequences." stage (output shown below). Running the script had the same result, minus the errors.

Checking query data and substitutions files

Searching database for candidate sequences

Aligning queries with candidate sequences Alignment score and position are not consensus.82.50/100.00% **

pauline-ng commented 4 years ago

Hi @sloux ,

These are errors that are generated from @rvaser 's SIFT 4G program. As a start, can you send @rvaser your all_prot.fasta file?

@rvaser -- Can you help @sloux out?

Thanks, Pauline

rvaser commented 4 years ago

Hi @sloux, could you please send me your data over mail so I can investigate locally? The error you are getting is in the alignment algorithm for some reason. No idea what could be causing it.

Best regards, Robert

pauline-ng commented 4 years ago

Issue transferred to