sarah872 commented 2 years ago

Hi, I'm running snapt on a closed bacterial genome. It's running fine until the CURATE NON_CODING TRANSCRIPTS step, for which prodigal compains about a sequence being too short. Do you have any idea how to troubleshoot?

Thank you!

snapt -1 /tmp/slurm-5397966/R1.fq -2 /tmp/slurm-5397966/R2.fq -g genome.fasta -a /tmp/slurm-5397966/CDS.gff -l 3000 -o /tmp/slurm-5397966/SNAPT_OUT -t 64 --nr /tmp/slurm-5397966/nr.dmnd --rfam /tmp/slurm-5397966/Rfam.cm -r RF

########################################################################################################################
#####                                                BEGIN PIPELINE!                                               #####
########################################################################################################################

------------------------------------------------------------------------------------------------------------------------
-----                         setting up output folder and copying relevant information...                         -----
------------------------------------------------------------------------------------------------------------------------

########################################################################################################################
#####                                     ALIGN RNA READS TO GENOME WITH HISAT2                                    #####
########################################################################################################################

------------------------------------------------------------------------------------------------------------------------
-----                   Making sure the Perl environment is linked to the conda Perl libraries...                  -----
------------------------------------------------------------------------------------------------------------------------

/apps/hisat2/2.1.0/hisat2-align-s version 2.1.0
64-bit
Built on login-node03
Wed Jun  7 15:53:42 EDT 2017
Compiler: gcc version 4.8.2 (GCC)
Options: -O3 -m64 -msse2 -funroll-loops -g3 -DPOPCNT_CAPABILITY
Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
Looks like the perl libraries are set correctly. Continue...

------------------------------------------------------------------------------------------------------------------------
-----                                  Building Hisat2 index from reference genome                                 -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                       Aligning /tmp/slurm-5397966/R1.fq and /tmp/slurm-5397966/R2.fq to                      -----
-----             genome.fasta with             -----
-----                                                    hisat2                                                    -----
------------------------------------------------------------------------------------------------------------------------

<...>

------------------------------------------------------------------------------------------------------------------------
-----                       Sorting hisat2 SAM alignment file and converting it to BAM format                      -----
------------------------------------------------------------------------------------------------------------------------

[bam_sort_core] merging from 320 files and 64 in-memory blocks...

------------------------------------------------------------------------------------------------------------------------
-----             Building IGV index from hisat2 i/tmp/slurm-5397966/SNAPT_OUT/hisat2/alignment.bam and            -----
-----                                                faidx index of                                                -----
-----           genome.fasta for IGV            -----
-----                                                  inspection                                                  -----
------------------------------------------------------------------------------------------------------------------------

########################################################################################################################
#####                                             ASSEMBLE TRANSCRIPTS                                             #####
########################################################################################################################

------------------------------------------------------------------------------------------------------------------------
-----                       sorting /tmp/slurm-5397966/CDS.gff annotation to guide Stringtie                       -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                           There are a total of 2410 transcripts in the annotation.                           -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                   Assembling reference-based transcripts for antisense ncRNA calls (-c 10)                   -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                      Assembled 4828 transcripts from the alignment with Stringtie -c 10.                     -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                   Assembling reference-based transcripts for intergenic ncRNA calls (-c 5)                   -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                      Assembled 4828 transcripts from the alignment with Stringtie -c 5.                      -----
------------------------------------------------------------------------------------------------------------------------

########################################################################################################################
#####                                               RUNNING PRODIGAL                                               #####
########################################################################################################################

------------------------------------------------------------------------------------------------------------------------
-----                                 Predicting open reading frames with Prodigal                                 -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                                          sorting Prodigal annotation                                         -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                                   There are 2410 Prodigal ORFs identified.                                   -----
------------------------------------------------------------------------------------------------------------------------

########################################################################################################################
#####                                       REMOVE PROTEIN-CODING TRANSCRIPTS                                      #####
########################################################################################################################

------------------------------------------------------------------------------------------------------------------------
-----           Intersecting the transcripts with the ORFs found with Prodigal to remove transcripts that          -----
-----                                           are from coding regions                                            -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----             Out of 4828 -c 10 assembled transcripts, identified 3 putative non-coding transcripts            -----
-----                                        using the Prodigal annotation.                                        -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----             Out of 4828 -c 5 assembled transcripts, identified 3 putative non-coding transcripts             -----
-----                                        using the Prodigal annotation.                                        -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                         Intersecting the transcripts with the provided annotation in                         -----
-----                              /tmp/slurm-5397966/SNAPT_OUT/sorted_annotation.gff                              -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----             Out of 4828 -c 10 assembled transcripts, identified 3 putative non-coding transcripts            -----
-----                   using the /tmp/slurm-5397966/SNAPT_OUT/sorted_annotation.gff annotation.                   -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----             Out of 4828 -c 5 assembled transcripts, identified 3 putative non-coding transcripts             -----
-----                   using the /tmp/slurm-5397966/SNAPT_OUT/sorted_annotation.gff annotation.                   -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                  Consolidating intergenic and antisense transcripts predicted from ORFs and                  -----
-----                              /tmp/slurm-5397966/SNAPT_OUT/sorted_annotation.gff                              -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                           Filtered down to 3 putative -c 10 non-coding transcripts                           -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                            Filtered down to 3 putative -c 5 non-coding transcripts                           -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----             Selecting and combining intergenic ncRNAs identified with -c 5 pileup, and antisense             -----
-----                                         ncRNAs identified with -c 10                                         -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                                           Sorting concatinated set                                           -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----                         Extracted 0 antisense and 3 intergenic non-coding transcripts                        -----
------------------------------------------------------------------------------------------------------------------------

########################################################################################################################
#####                                         CURATE NON_CODING TRANSCRIPTS                                        #####
########################################################################################################################

------------------------------------------------------------------------------------------------------------------------
-----                    re-running Prodigal on isolated transcripts to check for coding regions                   -----
------------------------------------------------------------------------------------------------------------------------

Error:  Sequence must be 20000 characters (only 978 read).
(Consider running with the -p meta option or finding more contigs from the same genome.)

************************************************************************************************************************
*****                                      Failed to run Prodigal. Exiting...                                      *****
************************************************************************************************************************

dgelsin commented 2 years ago

Hi Sarah,

Thanks for bringing this to our attention. It looks like an issue with running prodigal an input fasta less than 20000 bp. Prodigal is normally used to predict coding sequences from genomes, and since snapt only called 3 ncRNAs in the previous steps then the input fasta for re-running prodigal is <20000 bp.

@ursky, what do you think? I think this issue hasn't arose for us before because we usually get 100+ ncRNA calls which yields 20000+ bp of sequence. Found similar issue here: https://github.com/hyattpd/Prodigal/issues/51

Looks like a quick fix is to run (a) prodigal separately in anonymous mode or meta mode (-p option). If you wanted to complete the snapt pipeline with prodigal, then it looks like you can (b) edit prodigal itself by change line 32:

define MIN_SINGLE_GENOME 20000

Not sure I recommend this route though since you could break prodigal that way. Will see if I can update snapt to include -p option for prodigal and see if that solves it.

sarah872 commented 2 years ago

Hi, thanks for suggesting a quick fix. I modified the script to include the -p meta flag with prodigal. However, there's another issue now: No antisense nc transcripts have been found, which leads to this:

------------------------------------------------------------------------------------------------------------------------
-----              running DIAMOND blastx on /tmp/slurm-5401156/SNAPT_OUT/blastx_search/intergenic.fa              -----
-----                                  against the /tmp/slurm-5401156/nr database                                  -----
------------------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------------------------------
-----           running DIAMOND blastx on /tmp/slurm-5401156/SNAPT_OUT/blastx_search/antisense.fa against          -----
-----                                      the /tmp/slurm-5401156/nr database                                      -----
------------------------------------------------------------------------------------------------------------------------

Error: Error detecting input file format. First line seems to be blank.

************************************************************************************************************************
*****                         Failed DIAMOND Blastx search against the database. Exiting...                        *****
************************************************************************************************************************

The file antisense.fa was indeed empty. I suppose a simple if to check whether the file is empty or not before executing diamond would suffice? What do you think?

ursky / SnapT

Prodigal error #3

define MIN_SINGLE_GENOME 20000