mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
769 stars 165 forks source link

flye usage #690

Closed emilydolivo97 closed 5 months ago

emilydolivo97 commented 6 months ago

I chose "nanopolish" to call variants. when I read the documentation of "nanopolish" I found that I shoud give fasta form of my different filtered fastq files ( I dont have fast5 format) . For this purpose I used 'seqtk" to convert each fastq file ( I have 10 files) to 10 fasta files and now I'm using "flye" for genome assembly. The problem is that the programm is taking too long .
Is there a way to stop it or speed it at specific stage ? this stage must be suitable for my nanopolish analysis !! .

this is my script :

import os import subprocess

class FastqToAssembledFasta: def init(self, input_folder, output_folder): self.input_folder = input_folder self.output_folder = output_folder

def convert_to_fasta(self):
    # Create the output folder if it doesn't exist
    if not os.path.exists(self.output_folder):
        os.makedirs(self.output_folder)

    # Get a list of all FASTQ files in the input folder
    fastq_files = [f for f in os.listdir(self.input_folder) if f.endswith('.fastq.gz')]

    # Convert each FASTQ file to FASTA format
    for fastq_file in fastq_files:
        input_path = os.path.join(self.input_folder, fastq_file)
        output_path = os.path.join(self.output_folder, fastq_file.replace('.fastq.gz', '.fasta'))
        seqtk_cmd = f'seqtk seq -a {input_path} > {output_path}'
        subprocess.run(seqtk_cmd, shell=True)

    return self.output_folder  # Return the output folder containing the converted FASTA files

def assemble_reads(self, fasta_folder):
    # Assemble the reads using Flye
    assembly_output = os.path.join(self.output_folder, 'assembly.fasta')
    flye_cmd = f'flye --nano-raw {fasta_folder}/*.fasta --out-dir {self.output_folder} -t 8 --keep-haplotypes'
    subprocess.run(flye_cmd, shell=True)
    # Rename the assembly output to a more descriptive name
    os.rename(os.path.join(self.output_folder, 'assembly.fasta'), assembly_output)

input_folder = '/data/filtred_reads' output_folder = '/data/converted_assembled_reads'

Convert FASTQ files to FASTA format

converter = FastqToAssembledFasta(input_folder, output_folder) fasta_folder = converter.convert_to_fasta()

Assemble reads into a single FASTA file using Flye

converter.assemble_reads(fasta_folder)

mikolmogorov commented 5 months ago

Please see the manual for estimated running times for different datasets. It contains info how to stop / resume from different stages as well. If anything is unclear - feel free to follow up in this topic.