Closed sloux closed 4 years ago
@sloux -- Can you check that all your protein sequences are real amino acids? (no stops * or X)
cc-ing @rvaser so he's aware of the SIFT4G message
Thanks, Pauline
Looking through the protein database, at first glance, all appear to be real amino acids.
Looking more closely, there are no * in the database; however, there are 74 proteins which contain at least 3 X's (out of 44,934 proteins). When I run the code, I get 1,214 of the "Use of initialized value ... at make-single-records-BIOPERL.pl line 274".
The strangest part is that running the code different times results in a different number of errors. An earlier running gave me 8 of those errors, but also had the following error 10 times:
Use of uninitialized value $orig_aa in string eq at make-single-records-BIOPERL.pl line 551. Use of uninitialized value $mutated_aa in string eq at make-single-records-BIOPERL.pl line 551.
I just wanted to add that I removed all proteins containing an aa value of X from my protein file and am still getting the same error.
Use of uninitialized value $aa in string eq at make-single-records-BIOPERL.pl line 274. Use of uninitialized value $aa in concatenation (.) or string at make-single-records-BIOPERL.pl line 280.
I'm using the most recent version of perl, if that helps with troubleshooting (5.30.0).
Hi @sloux ,
The uninitialized value means that some proteins failed to translate -- I ignore these warnings and am able to build a database.
Some troubleshooting tips:
If you ran this more than once without deleting the intermediate files, then this can cause a problem. There is a step in the script that appends to an existing file -- if the script was run more than once, then that file will have duplicate sequences that will cause problems with the SIFT 4G algorithm.
To check this, find the file $meta_hash{"PARENT_DIR"}/all_prot.fasta
can you do
grep ">" | sort | uniq -d
If you see duplicates, then that means that file has duplicates which is incorrect.
Incidentally, all_prot.fasta is the file that cannot have X's in the protein sequences.
If all_prot.fasta does contain duplicates, then delete all the files in the folder, read instructions, and start fresh.
If it does not have duplicates, then run the script attached. I took make-SIFT-db.pl and removed all steps prior to the SIFT 4G command, and it calls the SIFT 4G algorithm from the start. We can troubleshoot better from there.
Thanks, Pauline
Thanks Pauline,
I did have duplicates in the all_prot.fasta file, so I deleted all of the files in the folder except the original files and restarted. This genome (EquCab3.0) failed during the original run because of a lack of memory, so I restarted it on a higher memory partition. All comments so far (10 minutes) appear to be normal.
I also created SIFT database for the current bovine genome (ARS-USDA1.2), which ran successfully except for the contigs from chromosome Un, so I am hopeful that this will solve the problem.
The script failed again. It had 5296 lines of the same types of error as before, but continued until the " Aligning queries with candidate sequences." stage (output shown below). Running the script make-SIFT-db-starting_from_SIFT4G.pl had the same result, minus the errors.
Checking query data and substitutions files
Searching database for candidate sequences
Aligning queries with candidate sequences Alignment score and position are not consensus.82.50/100.00% **
Hi @sloux ,
These are errors that are generated from @rvaser 's SIFT 4G program. As a start, can you send @rvaser your all_prot.fasta file?
@rvaser -- Can you help @sloux out?
Thanks, Pauline
Hi @sloux, could you please send me your data over mail so I can investigate locally? The error you are getting is in the alignment algorithm for some reason. No idea what could be causing it.
Best regards, Robert
Issue transferred to https://github.com/rvaser/sift4g
Hi Pauline,
I'm having a few more issues with the script. I'm getting the following error: It will continue to run, but ends up ultimately failing. Thanks for your help!
converting gene format to use-able input done converting gene format making single records file
Possible precedence issue with control flow operator at /opt/ohpc/pub/libs/gnu8/ccs/perl/5.30.0/lib/site_perl/5.30.0/Bio/DB/IndexedBase.pm line 845. Subroutine Bio::DB::IndexedBase::_strip_crnl redefined at /opt/ohpc/pub/libs/gnu8/ccs/perl/5.30.0/lib/site_perl/5.30.0/Bio/DB/IndexedBase.pm line 304. Use of uninitialized value $aa in string eq at make-single-records-BIOPERL.pl line 274. Use of uninitialized value $aa in concatenation (.) or string at make-single-records-BIOPERL.pl line 280.
With an ultimate failure at: Aligning queries with candidate sequences