linzhi2013 / MitoZ

MitoZ: A toolkit for assembly, annotation, and visualization of animal mitochondrial genomes
https://doi.org/10.1093/nar/gkz173
GNU General Public License v3.0
117 stars 39 forks source link

Parameter combination recommendations #215

Open kiran-lee opened 4 months ago

kiran-lee commented 4 months ago

on which platform/server? (Windows? Windows Sublinux? MacOS? Ubuntu? etc.)

Linux

MitoZ version?

3.6

How did you install MitoZ? (e.g. Docker, Udocker, Singularity, Conda-Pack, Conda, or source code)

Conda

Did you run a test after your installation, and was the test run okay?

Yes. OK.

How much data (roughly) did you use for mitogenome assembly? e.g. 5Gbp?

25 Gbp.

The command you used?

mitoz all  \ --outprefix sw \ --thread_number 20 \ --clade Chordata \ --requiring_taxa Chordata \ --genetic_code 2 \ --species_name "Seychelles warbler" \ --fq1 102_ACTTAGATCG-CGGAATTCTT_L002trimmed_paired_R1.fastq.gz \ --fq2 102_ACTTAGATCG-CGGAATTCTT_L002trimmed_paired_R2.fastq.gz \ --fastq_read_length 151 \ --data_size_for_mt_assembly 25,0 \ --assembler megahit \ --kmers_megahit 39 59 79 99 119 141 \ --memory 100 \ --requiring_taxa Chordata \ --min_abundance 0

Problem description

From your experience do you have suggestions for combinations of parameters to use on a sample of raw paired-end reads, with mean read depth of 15x?

I have tried 13 combinations that vary in the 1) sample used (either a ~17x coverage or 10x coverage sample), 2) assembler used (megahit or spades), 3) data size used for assembly (5, 25 ,50 and 80), 4) kmers ("Large 39 59 79 99 119 141" or "Small 21 31 41 51 61 71 81 91”) and 5) whether reads were trimmed or not. I attach the below table summarising the combinations I have tried (MitoZ_combinations.xlsx).

The command that works best (attached above) finds all genes but is non-circular and produces two seq_id (combo5summary.txt). The read depth across the genome looks OK apart from the beginning (combo5circos.depth.txt). This is the command :

mitoz all  \ --outprefix sw \ --thread_number 20 \ --clade Chordata \ --requiring_taxa Chordata \ --genetic_code 2 \ --species_name "Seychelles warbler" \ --fq1 102_ACTTAGATCG-CGGAATTCTT_L002trimmed_paired_R1.fastq.gz \ --fq2 102_ACTTAGATCG-CGGAATTCTT_L002trimmed_paired_R2.fastq.gz \ --fastq_read_length 151 \ --data_size_for_mt_assembly 25,0 \ --assembler megahit \ --kmers_megahit 39 59 79 99 119 141 \ --memory 100 \ --requiring_taxa Chordata \ --min_abundance 0

The raw paired-end reads can be found here: 102: https://cgr.liv.ac.uk/illum/LIMS26629_51a15827930a0b65/Raw/Sample_102/ 53: https://cgr.liv.ac.uk/illum/LIMS25133_4f8b5ec41474a239/Raw/Sample_53-11998DH0147L01_4879/

Log messages from MitoZ (stdout and stderr, e.g., both m.log and m.err files)

Attached as combo5.log combo5.log and combo5errorsummaryval.txt combo5errorsummaryval.txt

linzhi2013 commented 4 months ago

Hi @kiran-lee ,

Thanks for your detailed explaination!

Based on my experience on mammals (your samples are birds), 2-5Gbp or 8Gbp is good enough for assembling circular mitogenome.

I have no better recommendations now. But maybe you can map all the raw data to the mitogenomes of some closely related species? And use a loose cutoff to keep many alignable reads. Then use the mapped reads to assemble the mitogenome with MitoZ?

Best