make-single-records error

pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.

GNU General Public License v3.0

22 stars 7 forks source link

make-single-records error #15

Closed sloux closed 4 years ago

sloux commented 4 years ago

Hi Pauline,

I'm having a few more issues with the script. I'm getting the following error: It will continue to run, but ends up ultimately failing. Thanks for your help!

converting gene format to use-able input done converting gene format making single records file

Possible precedence issue with control flow operator at /opt/ohpc/pub/libs/gnu8/ccs/perl/5.30.0/lib/site_perl/5.30.0/Bio/DB/IndexedBase.pm line 845. Subroutine Bio::DB::IndexedBase::_strip_crnl redefined at /opt/ohpc/pub/libs/gnu8/ccs/perl/5.30.0/lib/site_perl/5.30.0/Bio/DB/IndexedBase.pm line 304. Use of uninitialized value $aa in string eq at make-single-records-BIOPERL.pl line 274. Use of uninitialized value $aa in concatenation (.) or string at make-single-records-BIOPERL.pl line 280.

With an ultimate failure at: Aligning queries with candidate sequences

processing database part 39 (size ~1.00 GB): 7.50/100.00% Alignment score and position are not consensus.15.00/100.00% *

pauline-ng commented 4 years ago

@sloux -- Can you check that all your protein sequences are real amino acids? (no stops * or X)

cc-ing @rvaser so he's aware of the SIFT4G message

Thanks, Pauline

sloux commented 4 years ago

Looking through the protein database, at first glance, all appear to be real amino acids.

Looking more closely, there are no * in the database; however, there are 74 proteins which contain at least 3 X's (out of 44,934 proteins). When I run the code, I get 1,214 of the "Use of initialized value ... at make-single-records-BIOPERL.pl line 274".

The strangest part is that running the code different times results in a different number of errors. An earlier running gave me 8 of those errors, but also had the following error 10 times:

Use of uninitialized value $orig_aa in string eq at make-single-records-BIOPERL.pl line 551. Use of uninitialized value $mutated_aa in string eq at make-single-records-BIOPERL.pl line 551.

sloux commented 4 years ago

I just wanted to add that I removed all proteins containing an aa value of X from my protein file and am still getting the same error.

Use of uninitialized value $aa in string eq at make-single-records-BIOPERL.pl line 274. Use of uninitialized value $aa in concatenation (.) or string at make-single-records-BIOPERL.pl line 280.

I'm using the most recent version of perl, if that helps with troubleshooting (5.30.0).

pauline-ng commented 4 years ago

Hi @sloux ,

The uninitialized value means that some proteins failed to translate -- I ignore these warnings and am able to build a database.

Some troubleshooting tips:

If you ran this more than once without deleting the intermediate files, then this can cause a problem. There is a step in the script that appends to an existing file -- if the script was run more than once, then that file will have duplicate sequences that will cause problems with the SIFT 4G algorithm.

To check this, find the file $meta_hash{"PARENT_DIR"}/all_prot.fasta

can you do grep ">" | sort | uniq -d

If you see duplicates, then that means that file has duplicates which is incorrect.
Incidentally, all_prot.fasta is the file that cannot have X's in the protein sequences.

If all_prot.fasta does contain duplicates, then delete all the files in the folder, read instructions, and start fresh.

If it does not have duplicates, then run the script attached. I took make-SIFT-db.pl and removed all steps prior to the SIFT 4G command, and it calls the SIFT 4G algorithm from the start. We can troubleshoot better from there.

Thanks, Pauline

make-SIFT-db-starting_from_SIFT4G.pl.txt

sloux commented 4 years ago

Thanks Pauline,

I did have duplicates in the all_prot.fasta file, so I deleted all of the files in the folder except the original files and restarted. This genome (EquCab3.0) failed during the original run because of a lack of memory, so I restarted it on a higher memory partition. All comments so far (10 minutes) appear to be normal.

I also created SIFT database for the current bovine genome (ARS-USDA1.2), which ran successfully except for the contigs from chromosome Un, so I am hopeful that this will solve the problem.

sloux commented 4 years ago

The script failed again. It had 5296 lines of the same types of error as before, but continued until the " Aligning queries with candidate sequences." stage (output shown below). Running the script make-SIFT-db-starting_from_SIFT4G.pl had the same result, minus the errors.

Checking query data and substitutions files

processing queries: 100.00/100.00% *

Searching database for candidate sequences

processing database part 194 (size ~0.25 GB): 100.00/100.00% *

Aligning queries with candidate sequences Alignment score and position are not consensus.82.50/100.00% **

pauline-ng commented 4 years ago

Hi @sloux ,

These are errors that are generated from @rvaser 's SIFT 4G program. As a start, can you send @rvaser your all_prot.fasta file?

@rvaser -- Can you help @sloux out?

Thanks, Pauline

rvaser commented 4 years ago

Hi @sloux, could you please send me your data over mail so I can investigate locally? The error you are getting is in the alignment algorithm for some reason. No idea what could be causing it.

Best regards, Robert

pauline-ng commented 4 years ago

Issue transferred to https://github.com/rvaser/sift4g