linzhi2013 / MitoZ

MitoZ: A toolkit for assembly, annotation, and visualization of animal mitochondrial genomes
https://doi.org/10.1093/nar/gkz173
GNU General Public License v3.0
114 stars 39 forks source link

Endless read from file #218

Open HaoqingDu opened 1 month ago

HaoqingDu commented 1 month ago

I installed mitoz v3.6 using conda. The test run was ok. I wanted to generate mitogenome from a SE data. Below is my command,

mitoz all --outprefix PhoSp1_trimmed --clade Arthropoda --genetic_code 5 --species_name "Photinus sp1" --fq1 PhoSp1_trimmed.fastq.gz --fastq_read_length 151 --assembler mitoassemble --kmers 51 --requiring_taxa Arthropoda

In the assemble part, read_from_file seems to never stop unless I kill it. Below is the related log info,

2024-08-08 11:37:24,164 - mitoz.utility.utility - INFO - 
/home/hdu/miniconda3/envs/mitoz_env/lib/python3.8/site-packages/mitoz/assemble/script/mitoAssemble/mitoAssemble all  -K 51 -o PhoSp1_trimmed.mitoAssemble.K51 -s PhoSp1_trimmed.mitoAssemble.K51.mitoAssemble.lib -p 8

mitoAssemble is based on SoapTrans v1.03 and modified to fit mitogenome assembly by BGI-ENVers

pregraph -s PhoSp1_trimmed.mitoAssemble.K51.mitoAssemble.lib -K 51 -p 8 -o PhoSp1_trimmed.mitoAssemble.K51 
In PhoSp1_trimmed.mitoAssemble.K51.mitoAssemble.lib, 1 libs, max seq len 151, max name len 256

8 thread created
read from file:
 /home/hdu/Documents/firefly_genomes/results/mitoz/clean_data/PhoSp1_trimmed.clean_R1.fq.gz
--- 100000000th reads
--- 200000000th reads
--- 300000000th reads
--- 400000000th reads
...

I don't know how to solve it.

linzhi2013 commented 1 month ago

Hi @HaoqingDu ,

/home/hdu/miniconda3/envs/mitoz_env/lib/python3.8/site-packages/mitoz/assemble/script/mitoAssemble/mitoAssemble all  -K 51 -o PhoSp1_trimmed.mitoAssemble.K51 -s PhoSp1_trimmed.mitoAssemble.K51.mitoAssemble.lib -p 8

Could you please have a look at the file: PhoSp1_trimmed.mitoAssemble.K51.mitoAssemble.lib and check if everything inside is okay?

HaoqingDu commented 1 month ago

Hi @linzhi2013

Thanks for reply. Here is the lines in the lib file

max_rd_len=151
[LIB]
avg_ins=250
reverse_seq=0
asm_flags=3
map_len=32
q=/home/hdu/Documents/firefly_genomes/results/mitoz/clean_data/PhoSp1_trimmed.clean_R1.fq.gz

I also compared it with the lib file for the test data, below is the test data lib file

max_rd_len=150
[LIB]
avg_ins=250
reverse_seq=0
asm_flags=3
map_len=32
q=/home/hdu/Documents/firefly_genomes/data/mitoz_test/test.R1.fq.gz

They look alike, so I assume this may not be the reason?

And the difference between my run and the successful test run is, at this step, inside the K51 directory /home/hdu/Documents/firefly_genomes/results/mitoz/mt_assembly/mitoAssemble/K51, there is only this lib file, no other files. Is this normal?

linzhi2013 commented 1 month ago

check the content of /home/hdu/Documents/firefly_genomes/results/mitoz/clean_data/PhoSp1_trimmed.clean_R1.fq.gz

and check if it is broken:

gzip -t /home/hdu/Documents/firefly_genomes/results/mitoz/clean_data/PhoSp1_trimmed.clean_R1.fq.gz

You can also run the following command directly and check what will happen:

/home/hdu/miniconda3/envs/mitoz_env/lib/python3.8/site-packages/mitoz/assemble/script/mitoAssemble/mitoAssemble all  -K 51 -o PhoSp1_trimmed.mitoAssemble.K51 -s PhoSp1_trimmed.mitoAssemble.K51.mitoAssemble.lib -p 8
HaoqingDu commented 1 month ago

Hi,

I just tried both.

First, gzip -t returns nothing, which means the file is complete, right?

Second, when I run mitoAssemble, same thing happens

(mitoz_env) hdu@wla110:~/Documents/firefly_genomes/results/mitoz/mt_assembly/mitoAssemble/K51> /home/hdu/miniconda3/envs/mitoz_env/lib/python3.8/site-packages/mitoz/assemble/script/mitoAssemble/mitoAssemble all  -K 51 -o PhoSp1_trimmed.mitoAssemble.K51 -s PhoSp1_trimmed.mitoAssemble.K51.mitoAssemble.lib -p 8

mitoAssemble is based on SoapTrans v1.03 and modified to fit mitogenome assembly by BGI-ENVers

pregraph -s PhoSp1_trimmed.mitoAssemble.K51.mitoAssemble.lib -K 51 -p 8 -o PhoSp1_trimmed.mitoAssemble.K51 
In PhoSp1_trimmed.mitoAssemble.K51.mitoAssemble.lib, 1 libs, max seq len 151, max name len 256

8 thread created
read from file:
 /home/hdu/Documents/firefly_genomes/results/mitoz/clean_data/PhoSp1_trimmed.clean_R1.fq.gz
--- 100000000th reads
--- 200000000th reads
--- 300000000th reads
--- 400000000th reads
--- 500000000th reads
--- 600000000th reads
--- 700000000th reads
--- 800000000th reads
--- 900000000th reads
--- 1000000000th reads
--- 1100000000th reads
--- 1200000000th reads
--- 1300000000th reads
--- 1400000000th reads
--- 1500000000th reads
--- 1600000000th reads
--- 1700000000th reads
--- 1800000000th reads
linzhi2013 commented 1 month ago

1800000000 * 150 = 270,000,000,000 bp

You are using 270 Gbp for mitochondrial genome assembly, which is too much, the program cannot handle such a big data.

Try to use 3-5 Gbp data instead.

HaoqingDu commented 1 month ago

The raw reads data (fq.gz format) is 30 Gb. I trimmed it with trimmomatic. So the input file fed to Mitoz is only 13 Mb large and contains around 37k reads.

I also tried to run mitoz all with --data_size_for_mt_assembly 3,0 to subsample the data if it were too large. But the same thing happened in read from file.

linzhi2013 commented 1 month ago

The raw reads data (fq.gz format) is 30 Gb. I trimmed it with trimmomatic. So the input file fed to Mitoz is only 13 Mb large and contains around 37k reads.

But the log says

read from file:
 /home/hdu/Documents/firefly_genomes/results/mitoz/clean_data/PhoSp1_trimmed.clean_R1.fq.gz
....
--- 1,800,000,000th reads

which is 1,800 M reads.

I also tried to run mitoz all with --data_size_for_mt_assembly 3,0 to subsample the data if it were too large. But the same thing happened in read from file.

I have no idea what happened there ...

HaoqingDu commented 1 month ago

I tried to run with megahit.

mitoz all \
--outprefix PhoSp1_trimmed \
--clade Arthropoda \
--genetic_code 5 \
--species_name "Photinus sp1" \
--fq1 PhoSp1_trimmed.fastq.gz \
--fastq_read_length 151 \
--data_size_for_mt_assembly 3,0 \
--assembler megahit \
--kmers_megahit 51 \
--requiring_taxa Arthropoda \
--min_abundance 0

The good thing is mitoz generated the mitogenome. So I guess the issue may result from mitoAssembler.

But Another thing is the generated genome is only 2450bp, super short. So my data might be not good enough either. However, it seems to be contradictory, because from the short genome I would say my data doesn't have enough reads, while mitoAssembler was loading infinite number of reads from my data.

linzhi2013 commented 1 month ago

For megahit, please make use of its multi-kmer feature, i.e., to use several kmers at the same time, which might be helpful.

In your code, you used only a single kmer, --kmers_megahit 51.

HaoqingDu commented 1 month ago

Thanks for the advice. Using multiple kmers does help! However, the generated mito-genome is still less than 5k bp. So I really need to improve my data.

linzhi2013 commented 1 month ago

Perhaps try small kmer sizes like 31, 35, 39 etc