pauline-ng / SIFT4G_Create_Genomic_DB

Create genomic databases with SIFT predictions. Input is an organism's genomic DNA (.fa) file and the gene annotation file (.gtf). Output will be a database that can be used with SIFT4G_Annotator.jar to annotate VCF files.
GNU General Public License v3.0
25 stars 7 forks source link

Issues creating database for new genome version #7

Closed durwa004 closed 5 years ago

durwa004 commented 5 years ago

Hi,

I am trying to create a database for the EquCab3 version of our genome which is only present on NCBI. This is the code I am running:

perl make-SIFT-db-all.pl -config test_files/EquCab3_config.txt

This is the error that I get:

done making the fasta sequences
start siftsharp, getting the alignments
/home/mccuem/durwa004/.conda/envs/sift4g/bin/sift4g -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/Variant_interpretation/sift4g/UniRef90/uniref90.fasta -q /home/mccuem/shared/Projects/HorseGenomeProject/Data/Variant_interpretation/sift4g/EquCab3_db_121018/all_prot.fasta --subst /home/mccuem/shared/Projects/HorseGenomeProject/Data/Variant_interpretation/sift4g/EquCab3_db_121018/subst --out /home/mccuem/shared/Projects/HorseGenomeProject/Data/Variant_interpretation/sift4g/EquCab3_db_121018/SIFT_predictions --sub-results 
** Checking query data and substitutions files **
terminate called after throwing an instance of 'std::regex_error'
  what():  regex_error

Prior to this error - I get a mixture of these 2 errors multiple times:

Use of uninitialized value $exon_num in concatenation (.) or string at generate-fasta-subst-files-BIOPERL.pl line 912.
Argument "" isn't numeric in addition (+) at generate-fasta-subst-files-BIOPERL.pl line 860.

It looks like some of the files are being produced - this is the total size of the files in my home directory:

32K fasta.log
32K Log2.txt
48K EquCab3
48K SIFT_alignments
48K SIFT_predictions
48K singleRecords_with_scores
64K invalid.log
12M .panfs.1b2d1f0a.1544496885706510000
30M all_prot.fasta
37M peptide.log
66M gene-annotation-src
209M    EquCab3.gene.gtf
595M    fasta
2.1G    subst
2.7G    chr-src
90G singleRecords
96G total

I have been having issues with the .gtf but think I have it in the right format (I have attached it to the issue). EquCab3.gtf.gz

Thanks in advance for your assistance.

Sian

rvaser commented 5 years ago

Hello Sian, from the log above I can see that SIFT4G has crashed, probably due to compiler problem. Which version is your gcc/g++ compiler?

Best regards, Robert

durwa004 commented 5 years ago

Hi Robert,

Thank you so much for getting back to me. I have been using gcc version 4.9.2.

I have resubmitted the job with version 7.2.0 to see if this is the issue.

Thanks,

Sian

durwa004 commented 5 years ago

Hi Robert,

It runs for longer but errors out with the same error: Use of uninitialized value $exon_num in concatenation (.) or string at generate-fasta-subst-files-BIOPERL.pl line 912. Argument "" isn't numeric in addition (+) at generate-fasta-subst-files-BIOPERL.pl line 860. Argument "" isn't numeric in addition (+) at generate-fasta-subst-files-BIOPERL.pl line 860. Argument "" isn't numeric in addition (+) at generate-fasta-subst-files-BIOPERL.pl line 860. Argument "" isn't numeric in addition (+) at generate-fasta-subst-files-BIOPERL.pl line 860. Checking query data and substitutions files terminate called after throwing an instance of 'std::regex_error' what(): regex_error

It looks like it happened while it was getting the alignments - this is the .o output: converting gene format to use-able input done converting gene format making single records file done making single records template making noncoding records file done making noncoding records make the fasta sequences done making the fasta sequences start siftsharp, getting the alignments /home/mccuem/durwa004/.conda/envs/sift4g/bin/sift4g -d /home/mccuem/shared/Projects/HorseGenomeProject/Data/Variant_interpretation/sift4g/UniRef90/uniref90.fasta -q /home/mccuem/shared/Projects/HorseGenomeProject/Data/Variant_interpretation/sift4g/EquCab3_db_121018/all_prot.fasta --subst /home/mccuem/shared/Projects/HorseGenomeProject/Data/Variant_interpretation/sift4g/EquCab3_db_121018/subst --out /home/mccuem/shared/Projects/HorseGenomeProject/Data/Variant_interpretation/sift4g/EquCab3_db_121018/SIFT_predictions --sub-results

Thanks,

Sian

pauline-ng commented 5 years ago

Sian,

Can I confirm you can run the examples and create those databases correctly?

If no, then the code is not correctly installed. If yes, then it's likely your input files are the problem.

Thanks, Pauline

durwa004 commented 5 years ago

Hi Pauline,

The examples run just fine. I will try and figure out what the issue is with the input files.

Thanks for your help,

Sian

On Wed, Dec 12, 2018 at 8:44 PM pauline-ng notifications@github.com wrote:

Sian,

Can I confirm you can run the examples and create those databases correctly?

If no, then the code is not correctly installed. If yes, then it's likely your input files are the problem.

Thanks, Pauline

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/issues/7#issuecomment-446822651, or mute the thread https://github.com/notifications/unsubscribe-auth/APGjy3314dkUQAnT2VqP_pRS1yiEOVK1ks5u4b71gaJpZM4ZOmbm .

-- Sian Durward-Akhurst, BVMS, MS, Diplomate ACVIM, MRCVS PhD Candidate in Equine Genetics

University of Minnesota Equine Genetics and Genomics Laboratory 225 Veterinary Population Medicine 1365 Gortner Avenue Saint Paul, MN 55108

durwa004@umn.edu

LipengKang commented 5 years ago

@pauline-ng Hi Pauline, I encountered similar problem like Sian. A big log file with repetitive warnings like "Use of uninitialized value in concatenation (.) or string at make-single-records-BIOPERL.pl line 301."

All input data are from ensemblplant(ftp://ftp.ensemblgenomes.org/pub/plants/release-42/gtf/triticum_aestivum/ and ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/dna/)

Any advice?

Thanks, Lipeng

pauline-ng commented 5 years ago

Hi Lipeng,

I am on vacation right now and will look at it when I return next week (next Thursday or Friday.)

Thank you, Pauline

durwa004 commented 5 years ago

Hi Lipeng,

I never figured out what the issue with my data was, but was able to successfully build the database using the gff from Ensembl.

Sorry I can't be any help.

Sian

On Thu, Mar 21, 2019 at 8:28 AM LipengKang notifications@github.com wrote:

@pauline-ng https://github.com/pauline-ng Hi Pauline, I encountered similar problem like Sian. A big log file with repetitive warnings like "Use of uninitialized value in concatenation (.) or string at make-single-records-BIOPERL.pl line 301."

All input data are from ensemblplant( ftp://ftp.ensemblgenomes.org/pub/plants/release-42/gtf/triticum_aestivum/ and ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/triticum_aestivum/dna/ )

Any advice?

Thanks, Lipeng

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/pauline-ng/SIFT4G_Create_Genomic_DB/issues/7#issuecomment-475228113, or mute the thread https://github.com/notifications/unsubscribe-auth/APGjy7Z9hm1vl3GDxxt83_s3EuUxNC-xks5vY4jigaJpZM4ZOmbm .

-- Sian Durward-Akhurst, BVMS, MS, Diplomate ACVIM, MRCVS PhD Candidate in Equine Genetics

University of Minnesota Equine Genetics and Genomics Laboratory 225 Veterinary Population Medicine 1365 Gortner Avenue Saint Paul, MN 55108

durwa004@umn.edu