kangxiongbin / StrainXpress

StrainXpress is a de novo assembly method which base on overlap-layout-consensus (OLC) paradigm and can fast and accurately assemble high complexity metagenome sequencing data at strain resolution.
GNU General Public License v3.0
13 stars 2 forks source link

The seq name in the output file "all.contigs_15000.fasta" are replicate numbers #9

Closed Yaqiao-Li closed 1 year ago

Yaqiao-Li commented 1 year ago

By testing with the example fastq and my own fastq, the contigs created are export to the file "all.contigs_15000.fasta". The sequences are named by repeated numbers.

Screen Shot 2023-04-13 at 5 00 07 PM
kangxiongbin commented 1 year ago

It is normal because these small contigs come from many small clusters, so there are duplicate names. Besides, this is not the final file and the final result file is in stage b folder: final_contigs.fasta. You can also modify the names of the contigs in this intermediate file by yourself.

Yaqiao-Li commented 1 year ago

Looks like strainXpress was not correctly installed. I followed the instructions for installation in a linux machine. But when testing with the example fq, it reports error as follows:

[04:47:31 mms ~/StrainXpress/example]$ python ../scripts/strainxpress.py -fq all_reads.fq successfully execute: split all_reads.fq -l 600 -d -a 2 sub successfully execute: cat cmd_overlap.sh | xargs -i -P 10 bash -c "{}"; successfully execute: for X in sub.map; do sort -k3 -nr < $X > sorted-$X; done; successfully execute: sort -k3 -nr -m sorted-sub.map > all_reads_sort.map; successfully execute: rm sub; successfully execute: python /home/ubuntu/StrainXpress/scripts/get_readnames.py all_reads.fq readnames.txt successfully execute: python /home/ubuntu/StrainXpress/scripts/bin_pointer_limited_filechunks_shortpath.py all_reads_sort.map readnames.txt 15000 strainxpress 10 pid 20313's current affinity mask: ffff pid 20313's new affinity mask: ff successfully execute: python /home/ubuntu/StrainXpress/scripts/getclusters.py strainxpress_max15000_final 10 begin... ################################################## the 1/1 part start... this is the: 0 for 100w lines the 1/1 part finished...

################################################## successfully execute: python /home/ubuntu/StrainXpress/scripts/get_fq_cluster.py strainxpress_max15000_final_clusters_grouped.json all_reads.fq /home/ubuntu/StrainXpress/example/fq_15000 successfully execute: rm -rf Chunkfile; rm strainxpress_max15000_final_clustersizes.json strainxpress_max15000_final_clusters_unchained.json strainxpress_max15000_final_clusters.json successfully execute: cat cmd_polyte.sh | xargs -i -P 10 bash -c "{}"; successfully execute: cat /home/ubuntu/StrainXpress/example/fq_15000//contigs.fasta > all.contigs_15000.fasta successfully execute: mkdir -p stageb [M::mm_idx_gen::0.0280.16] collected minimizers [M::mm_idx_gen::0.0420.23] sorted minimizers [M::main::0.0420.23] loaded/built the index for 72 target sequence(s) [M::mm_mapopt_update::0.0420.24] mid_occ = 10 [M::mm_idx_stat] kmer size: 21; skip: 11; is_hpc: 0; #seq: 72 [M::mm_idx_stat::0.0430.24] distinct minimizers: 5250 (69.41% are singletons); average occurrences: 1.397; average spacing: 6.283; total length: 46076 [M::worker_pipeline::0.0570.37] mapped 72 sequences [M::main] Version: 2.24-r1122 [M::main] CMD: minimap2 -t 10 --sr -X -c -k 21 -w 11 -s 60 -m 30 -n 2 -r 0 -A 4 -B 2 --end-bonus=100 ../contigs_b.fastq ../contigs_b.fastq [M::main] Real time: 0.058 sec; CPU: 0.022 sec; Peak RSS: 0.007 GB successfully execute: cd stageb; minimap2 -t 10 --sr -X -c -k 21 -w 11 -s 60 -m 30 -n 2 -r 0 -A 4 -B 2 --end-bonus=100 ../contigs_b.fastq ../contigs_b.fastq | python /home/ubuntu/StrainXpress/scripts/filter_trans_ovlp_inline_v3.py -len 100 -iden 0.99 -oh 2 -sfo > sfoverlaps.out; successfully execute: cd stageb; python /home/ubuntu/StrainXpress/scripts/sfo2overlaps.py --in sfoverlaps.out --out sfoverlap.out.savage --num_singles 72 --num_pairs 0; mkdir -p fastq; cp ../contigs_b.fastq ./fastq/singles.fastq; pipeline_per_stage.py Stage b done in 3 iterations Maximum read length per iteration: [3024, 3057, 3057] Number of contigs per iteration: [63, 61, 61] Number of overlaps per iteration: [69, 45, 42, 42] rm: cannot remove 'contigs_b.fasta': No such file or directory rm: cannot remove 'contigs_b.fastq': No such file or directory rm: cannot remove 'contigs_b.fa': No such file or directory successfully execute: cd stageb; python /home/ubuntu/StrainXpress/scripts/pipeline_per_stage.v3.py --no_error_correction --remove_branches true --stage b --min_overlap_len 100 --min_overlap_perc 0 --edge_threshold 1 --overlaps ./sfoverlap.out.savage --fastq ./fastq --max_tip_len 1000 --num_threads 10; python /home/ubuntu/StrainXpress/scripts/fastq2fasta.py ./singles.fastq ./final_contigs.fasta; rm -r contigs_b.fasta contigs_b.fastq contigs_b.fa fastq graph p s*;

Could you help to look at this? Thank you!

kangxiongbin commented 1 year ago

We are delighted that you are interested in StrainXpress. Based on the log files, there seems to be no other issues except for the inability to locate redundant files during deletion. Could you please confirm if you have generated the final result file final_contigs.fasta? Is this file empty?

Yaqiao-Li commented 1 year ago

The final_contigs.fasta has not been generated. There is a file "all.contigs_15000.fasta".

Yaqiao-Li commented 1 year ago
Screen Shot 2023-04-17 at 11 07 47 PM
kangxiongbin commented 1 year ago

Can you go to stageb folder and then perform follow commands: cd stageb; minimap2 -t 10 --sr -X -c -k 21 -w 11 -s 60 -m 30 -n 2 -r 0 -A 4 -B 2 --end-bonus=100 ../contigs_b.fastq ../contigs_b.fastq | python /home/ubuntu/StrainXpress/scripts/filter_trans_ovlp_inline_v3.py -len 100 -iden 0.99 -oh 2 -sfo > sfoverlaps.out;

python /home/ubuntu/StrainXpress/scripts/sfo2overlaps.py --in sfoverlaps.out --out sfoverlap.out.savage --num_singles 72 --num_pairs 0; mkdir -p fastq; cp ../contigs_b.fastq ./fastq/singles.fastq;

python /home/ubuntu/StrainXpress/scripts/sfo2overlaps.py --in sfoverlaps.out --out sfoverlap.out.savage --num_singles 72 --num_pairs 0; mkdir -p fastq; cp ../contigs_b.fastq ./fastq/singles.fastq;

Then let me see what files are generated in the "stageb" folder.

Yaqiao-Li commented 1 year ago
Screen Shot 2023-04-17 at 11 27 15 PM

Please see the attached screenshot of running command in stageb folder.

kangxiongbin commented 1 year ago

It seems to have generated the final result file "final_contigs.fasta". I am confused as to why it would throw an error when running directly in StrainXpress but runs fine separately. Could you show me the code from lines 135 to 171 in the strainxpress.py file that you have installed? I assume you haven't made any modifications to it yourself, right?

Yaqiao-Li commented 1 year ago
Screen Shot 2023-04-18 at 9 08 51 AM

I didn't make any changes after installation.

kangxiongbin commented 1 year ago

I'm very surprised and don't know why it conducts the rm first. Can you delete the command in 152: "rm -r contigs_b.fasta contigs_b.fastq contigs_b.fa fastq graph p s*;" and rerun StrainXpress again. Hope it works!

Yaqiao-Li commented 1 year ago

Hi Xiongbin, I changed the line 151 in strainxpress.py from 151 --fastq ./fastq --max_tip_len 1000 --num_threads %s; python %s/fastq2fasta.py ./singles.fastq \

to 151 --fastq ./fastq --max_tip_len 1000 --num_threads %s; python %s/fastq2fasta.py ./fastq/singles.fastq \ just added 'fastq/' before singles.fastq, and it works with no error reported. Hope this helps.

Thank you very much for help me debugging.

Yaqiao-Li commented 1 year ago

In line 114 and 115 114 if os.path.exists('contigs_b.fastq'): 115 execute("rm contigs_b.fastq")

Maybe the error occurred because the file 'contigs_b.fastq' has already been deleted, and then in line 152 it tried to delete this file again, but could not find it.