mrmckain / Fast-Plast

Automated de novo assembly of whole chloroplast genomes.
MIT License
43 stars 14 forks source link

Fast plast unable to orientate the plastome #44

Closed o-william-white closed 4 years ago

o-william-white commented 4 years ago

Hello,

Not really an issue with the program itself but I was wondering I might be able to get some feedback on a plastome assembly based on downloaded SRA data.

It seems to run for most of the script without issue. However, I get a message saying that it could not properly orientate the genome and I wanted to check that it was not due to an error on my part, or how I might optimise the assembly.

I ran the assembly with the command as follows

fast-plast.pl -1 fastq-dump-Bedadeti1-r1.fq.gz -2 fastq-dump-Bedadeti1-r2.fq.gz --name Bedadeti1 --bowtie_index Zingiberales --coverage_analysis --clean light --threads 12

Below is the Fast-Plast_Progress.log

Tue Feb 11 10:21:45 2020        Starting Fast-Plast v.1.2.8.
                                Assemblying plastome with 0 single end libraries and 1 paired end libraries.
Tue Feb 11 10:21:45 2020        Determining best kmer sizes.
                                K-mer sizes for SPAdes set at 55,87,121.
Tue Feb 11 10:21:45 2020        Starting read trimming with Trimmomatic.
                                Using /data/home/mpx469/software/Trimmomatic/Trimmomatic-0.39//trimmomatic-0.39.jar.
Tue Feb 11 10:28:14 2020        Starting read mapping with bowtie2.
                                Using /share/apps/centos7/bowtie2/2.3.4/bin/bowtie2.
                                Samples for Zingiberales used to make bowtie2 indices.
Tue Feb 11 11:13:11 2020        Starting initial assembly with SPAdes.
                                Using /share/apps/centos7/spades/3.11.1/bin/spades.py.
Tue Feb 11 11:15:42 2020        Starting improved assembly with afin.
                                Using command /data/home/mpx469/software/Fast-Plast/afin/afin -c filtered_spades_contigs.fsa -r ../1_Trimmed_Reads/Bedadeti1.trimmed* -l 150,50,50 -f .1 -d 100 -x 1065 -p 20,15,10 -i 2,1,1 -o Bedadeti1_afin.
                                After afin, there are 0 contigs with a maximum size of 0 and a minimum size of 100000000.
Tue Feb 11 11:17:12 2020        Removing nested contigs.
Tue Feb 11 11:17:12 2020        Checking chloroplast gene recovery in contigs.
                                Checking coverage of afin output with 0 contigs after contamination removal.
                                0% of known angiosperm chloroplast genes were recovered in Bedadeti1_afin_iter2.fa.
Tue Feb 11 11:17:14 2020        Starting plastome finishing.
                                Using /data/home/mpx469/software/Fast-Plast/bin/Angiosperm_Chloroplast_Genes.fsa for LSC, SSC, and IR identification.
Tue Feb 11 11:17:14 2020        Checking chloroplast gene recovery in contigs.
                                Checking coverage of final assembly. Final assembly is the last afin iteration.
                                0% of known angiosperm chloroplast genes were recovered in Bedadeti1_afin_iter2.fa.
                                Could not properly orientate the plastome. Either your plastome does not have an IR or there was an issue with the assembly. Best contigs are in /data/scratch/mpx469/fast-plast/Bedadeti1/Final_Assembly/Bedadeti1_afin_iter2.fa. A list of genes in each contig can be found in "Chloroplast_gene_composition_of_final_contigs.txt".

I checked the file /data/scratch/mpx469/fast-plast/Bedadeti1/Final_Assembly/Bedadeti1_afin_iter2.fa but found it was empty.

Below is the results_error.log file

TrimmomaticPE: Started with arguments:
 -threads 12 /data/scratch/mpx469/sra/fastq-dump-Bedadeti1-r1.fq.gz /data/scratch/mpx469/sra/fastq-dump-Bedadeti1-r2.fq.gz Bedadeti1_0.trimmed_P1.fq Bedadeti1_0.trimmed_U1.fq Bedadeti1_0.trimmed_P2.fq Bedadeti1_0.trimmed_U2.fq ILLUMINACLIP:/data/home/mpx469/software/Fast-Plast/bin/adapters/NEB-PE.fa:1:30:10 SLIDINGWINDOW:10:20 MINLEN:40
Using Long Clipping Sequence: 'ACACTCTTTCCCTACACGACGCTCTTCCGATC'
Using Long Clipping Sequence: 'GATCGGAAGAGCACACGTCTGAACTCCAGTC'
ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Read Pairs: 89162199 Both Surviving: 85174052 (95.53%) Forward Only Surviving: 3415778 (3.83%) Reverse Only Surviving: 317878 (0.36%) Dropped: 254491 (0.29%)
TrimmomaticPE: Completed successfully
Building a SMALL index
88907708 reads; of these:
  85174052 (95.80%) were paired; of these:
    82183103 (96.49%) aligned concordantly 0 times
    74218 (0.09%) aligned concordantly exactly 1 time
    2916731 (3.42%) aligned concordantly >1 times
    ----
    82183103 pairs aligned concordantly 0 times; of these:
      3380 (0.00%) aligned discordantly 1 time
    ----
    82179723 pairs aligned 0 times concordantly or discordantly; of these:
      164359446 mates make up the pairs; of these:
        156209564 (95.04%) aligned 0 times
        48521 (0.03%) aligned exactly 1 time
        8101361 (4.93%) aligned >1 times
  3733656 (4.20%) were unpaired; of these:
    3457824 (92.61%) aligned 0 times
    1864 (0.05%) aligned exactly 1 time
    273968 (7.34%) aligned >1 times
8.28% overall alignment rate
readline() on closed filehandle $afin_file at /data/home/mpx469/software/Fast-Plast/fast-plast.pl line 1375.
Command line argument error: Argument "query". File is not accessible:  `Bedadeti1_afin_iter2.fa'
readline() on closed filehandle $file at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 21.
Use of uninitialized value $seqlen in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 38.
Use of uninitialized value $current_start in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 88.
Use of uninitialized value $current_end in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 88.
Argument "" isn't numeric in sort at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 90.
Argument "" isn't numeric in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 92.
Use of uninitialized value $final_start in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 137.
Use of uninitialized value $final_end in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 137.
Argument "" isn't numeric in sort at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 140.
Argument "" isn't numeric in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 142.
BLAST options error: File Bedadeti1_regions_split0.fsa is empty
BLAST Database error: No alias or index file found for nucleotide database [Bedadeti1_regions_split0.fsa] in search path [/data/scratch/mpx469/fast-plast/Bedadeti1/5_Plastome_Finishing::]
readline() on closed filehandle $file at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 21.
Use of uninitialized value $seqlen in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 38.
Use of uninitialized value $current_start in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 88.
Use of uninitialized value $current_end in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 88.
Argument "" isn't numeric in sort at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 90.
Argument "" isn't numeric in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 92.
Use of uninitialized value $final_start in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 137.
Use of uninitialized value $final_end in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 137.
Argument "" isn't numeric in sort at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 140.
Argument "" isn't numeric in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 142.
BLAST options error: File Bedadeti1_regions_split1.fsa is empty
BLAST Database error: No alias or index file found for nucleotide database [Bedadeti1_regions_split1.fsa] in search path [/data/scratch/mpx469/fast-plast/Bedadeti1/5_Plastome_Finishing::]
readline() on closed filehandle $file at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 21.
Use of uninitialized value $seqlen in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 38.
Use of uninitialized value $current_start in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 88.
Use of uninitialized value $current_end in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 88.
Argument "" isn't numeric in sort at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 90.
Argument "" isn't numeric in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 92.
Use of uninitialized value $final_start in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 137.
Use of uninitialized value $final_end in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 137.
Argument "" isn't numeric in sort at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 140.
Argument "" isn't numeric in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 142.
BLAST options error: File Bedadeti1_regions_split2.fsa is empty
BLAST Database error: No alias or index file found for nucleotide database [Bedadeti1_regions_split2.fsa] in search path [/data/scratch/mpx469/fast-plast/Bedadeti1/5_Plastome_Finishing::]
readline() on closed filehandle $file at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 21.
Use of uninitialized value $seqlen in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 38.
Use of uninitialized value $current_start in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 88.
Use of uninitialized value $current_end in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 88.
Argument "" isn't numeric in sort at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 90.
Argument "" isn't numeric in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 92.
Use of uninitialized value $final_start in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 137.
Use of uninitialized value $final_end in hash element at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 137.
Argument "" isn't numeric in sort at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 140.
Argument "" isn't numeric in subtraction (-) at /data/home/mpx469/software/Fast-Plast/bin/sequence_based_ir_id.pl line 142.
BLAST options error: File Bedadeti1_regions_split3.fsa is empty
BLAST Database error: No alias or index file found for nucleotide database [Bedadeti1_regions_split3.fsa] in search path [/data/scratch/mpx469/fast-plast/Bedadeti1/5_Plastome_Finishing::]
Command line argument error: Argument "query". File is not accessible:  `Final_Assembly/Bedadeti1_afin_iter2.fa'
Died at /data/home/mpx469/software/Fast-Plast/fast-plast.pl line 876.

Note that I found three filtered contigs in the directory 4_Afin_Assembly/filtered_spades_contigs.fsa. I annotated these contigs using GeSeq and these look like the LSC, IR and SSC.

GeSeqJob-20200202-190949_NODE_1_length_88719_cov_37 327195_OGDRAW

GeSeqJob-20200202-190949_NODE_2_length_34519_cov_72 692453_OGDRAW

GeSeqJob-20200202-190949_NODE_3_length_11076_cov_37 978119_OGDRAW

Was fast-plast unable to complete the assembly simply because the data to link the contigs was missing? Or is there a way I can optimise the assembly?

Best wishes Oliver

mrmckain commented 4 years ago

Hi Oliver, Afin did not run. That would put your spades contigs together. Check that you have things correctly installed (or more likely that the needed libs for afin are installed). afin needs a c++ complier with c++11 support and zlib.h. I use the GNU compile 4.8.4 (at least). You can check /data/home/mpx469/software/Fast-Plast/afin/afin to make sure it is working.

o-william-white commented 4 years ago

Hello,

Thanks for the reply. I thought it might have something to do with my installation.

I could get a help message from afin so I thought it was working ok

/data/home/mpx469/software/Fast-Plast/afin/afin -h
Usage: /data/home/mpx469/software/Fast-Plast/afin/afin -c contigsfile(s) -r readsfile(s) [-o outfile] [-m sort_char] [-s sub_len]
          [-l search_loops] [-i min_cov] [-p min_overlap] [-t max_threads]
          [-d initial_trim] [-e max_missed] [-f stop_ext] [-g mismatch] [-x extend_len]
          [--silent] [--no_log] [--no_fusion] [--verbose] [--print_fused]

       /data/home/mpx469/software/Fast-Plast/afin/afin -h [--help]

  -c,--contigsfiles       Space (or comma) separated list of files containing contigs
  -r,--readsfiles         Space (or comma) separated list of files containing reads
  -o,--outfile            Output will be printed to the outfile specified, with a .fa extension for the contigs and .log extension for the logfile
  -m,--sort_char          [default:   4] Sorts the reads by the first max_sort_char characters
  -s,--sub_len            [default: 100] Will focus on the current last contig_sub_len characters of the contig in each search
  -l,--search_loops       [default:  10] Will search against each contig a maximum of max_search_loops times to attempt extension
  -i,--min_cov            [default:   3] Will stop adding bp's once the coverage falls below min_cov
  -p,--min_overlap        [default:  20] Only those reads overlapping the contig by at least min_overlap bp's will be returned in each search
  -t,--max_threads        [default:   4] Will only run max_threads threads at a time
  -d,--initial_trim       [default:   0] Length to trim off the beginning and end of each contig at the start of the program
  -e,--max_missed         [default:   5] Maximum allowable mismatched bp's for each read when checking troubled contig fusions
  -f,--stop_ext           [default:  .5] During extension, if the percentage of reads remaining after cleaning is below stop_ext, do not extend here
  -g,--mismatch           [default:  .1] maximum percentage of mismatches allowed when fusing two contigs
  -x,--extend_len         [default:  40] Will add a max of extend_len bp's each search loop
  --silent                Suppress screen output
  --no_log                Suppress log file creation
  --no_fusion             Only extend, no attempt will be made to fuse contigs
  --verbose               Output additional information to logfile and/or screen (except if output to that location is suppressed)
  --print_fused           Print to file (_fused.fasta) fused contigs just before fusion, for inspecting the fusion locations

I initially had some trouble with my installation so I specified the paths in the perl script manually. See below:

grep "###directories" -A 13 /data/home/mpx469/software/Fast-Plast/fast-plast.pl
###directories
my $FPROOT = "$FindBin::RealBin";
my $AFIN_DIR = "$FPROOT/afin";
my $COVERAGE_DIR = "$FPROOT/Coverage_Analysis";
my $FPBIN = "$FPROOT/bin";
my $TRIMMOMATIC="/data/home/mpx469/software/Trimmomatic/Trimmomatic-0.39//trimmomatic-0.39.jar"; #path to trimmomatic executable
my $BOWTIE2="/share/apps/centos7/bowtie2/2.3.4/bin/bowtie2"; #path to bowtie2 executable
my $SPADES="/share/apps/centos7/spades/3.11.1/bin/spades.py"; #path to spades executable
my $BLAST="/share/apps/centos7/blast+/2.7.1/bin/"; #path to blast executable
my $SSPACE="/data/home/mpx469/software/sspace_basic/SSPACE_Basic.pl/SSPACE_Basic.pl"; #path to sspace exectuable
my $BOWTIE1="/share/apps/centos7/bowtie/1.2.0/bin/bowtie/bowtie"; #path to bowtie1 executable
my $JELLYFISH="/share/apps/centos7/jellyfish/2.2.6/bin/jellyfish/jellyfish"; #path to jellyfish2 excecutable
$ENV{'PATH'} = $PATH.':'.$BOWTIE1;

Does the input data need to be in the same directory as the script? Perhaps it wasn't able to find $FPROOT?

I am running the script in a different directory with the following input. perhaps I am missing a specific library as you mentioned above

module load perl
module load tbb/2018_U2
module load gcc
export PATH=/data/home/mpx469/software/Fast-Plast/:$PATH

fast-plast.pl -1 fastq-dump-Bedadeti1-r1.fq.gz -2 fastq-dump-Bedadeti1-r2.fq.gz --name Bedadeti1 --bowtie_index Zingiberales --coverage_analysis --clean light --threads 12

I am not sure about the libraries you mentioned so perhaps I should get in touch with my university support

Best wishes Oliver

o-william-white commented 4 years ago

Hello again,

Just to update you, I installed zlib locally and provided the path to the directory in my script as follows

export PATH=/data/home/mpx469/software/zlib/zlib-1.2.11/:$PATH

However, I got the same error. If I wanted to run afin independently to check it is working ok, is it possible to do this using the output I have already generated?

For example if I ran the following command, what would I input as the reads files?

/data/home/mpx469/software/Fast-Plast/afin/afin -c Bedadeti1/4_Afin_Assembly/filtered_spades_contigs.fsa -r <readsfiles>

Best wishes Ollie

mrmckain commented 4 years ago

Hi Ollie, If you run afin like this and use full paths to the files, you should be able to tell if it is working. /data/home/mpx469/software/Fast-Plast/afin/afin -c filtered_spades_contigs.fsa -r ../1_Trimmed_Reads/Bedadeti1.trimmed* -l 50 -f .1 -d 100 -x 1065 -p 10 -i 1 -o Bedadeti1_afin.

Data does not need to be in the same directory as Fast-Plast.

All outfiles (not just the error) should be looked at. Some of these programs print useful information to STDOUT and STDERR.

Best, Michael

o-william-white commented 4 years ago

Hi Michael,

Using the afin command you suggested I found that this step was quite memory intensive and it was higher than the limit set on my university computing facility. When I increase the limit it runs without issue.

Many thanks for your thoughts and sharing the software

Best wishes Ollie

mrmckain commented 4 years ago

Glad it worked out. Afin can be memory intensive; it depends on the data being used since it reads it all into memory.

Best, Michael