shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.29k stars 158 forks source link

Fail to write out all the splited output #391

Closed PanZiwei closed 1 year ago

PanZiwei commented 1 year ago

I am using seqkit v2.4.0 and I tried to split my custom fasta by 4 parts with the command seqkit split2 -p 4 -O ${output_path} ${input_fasta}. However, the write function doesn't work. It only saved 1 single file instead of all parts.

Would really appreciate it if any hints can be provided to solve the problem. Thanks!

Describe your issue

(nanotest) [c-panz@sumner082]$ seqkit split2 -p 4 -O ${output_path} ${input_fasta}
[INFO] split seqs from /Ziwei/data/test.fasta
[INFO] split into 4 parts
[INFO] write 1 sequences to file: /Ziwei/B168.ecDNA_updated.part_001.fasta
(nanotest) [c-panz@sumner082]$ head -1 ${input_fasta}
>test.series 
shenwei356 commented 1 year ago

Is there only one sequence in the input file? Can you please check with the command below?

seqkit stats ${input_fasta}

seqkit split/split2 splits multiple sequences into some partitions/chunks/parts. e.g.,

$ seqkit stats ../tests/hairpin.fa
file                 format  type  num_seqs    sum_len  min_len  avg_len  max_len
../tests/hairpin.fa  FASTA   RNA     28,645  2,949,871       39      103    2,35

$ seqkit split2 -p 4 ../tests/hairpin.fa -O a
[INFO] split seqs from ../tests/hairpin.fa
[INFO] split into 4 parts
[INFO] write 7162 sequences to file: a/hairpin.part_001.fa
[INFO] write 7161 sequences to file: a/hairpin.part_002.fa
[INFO] write 7161 sequences to file: a/hairpin.part_003.fa
[INFO] write 7161 sequences to file: a/hairpin.part_004.fa

Did you intend to cut the input sequence into 4 segments? If so, you can use seqkit sliding, but you have to compute the sliding window (-W) and step size (-s). Another method way is using kmcp utils split-genomes:

$ seqkit stats t.fa 
file  format  type  num_seqs  sum_len  min_len  avg_len  max_len
t.fa  FASTA   RNA          1       99       99       99       99

# overlap of adjacent segments can be set with --split-overlap
$ kmcp utils split-genomes -m 1 -k 1 -n 4 t.fa -O b --force
09:34:17.731 [INFO] splitting the reference genome ...
09:34:17.732 [INFO]   genome size: 99, splitNumber: 4, splitOverlap: 0, splitSize: 25, step: 25, greedy: true
09:34:17.742 [INFO]   seq length   of the 4 chunks: 25, 25, 25, 24

$ seqkit stats b/*
processed files:  4 / 4 [======================================] ETA: 0s. done
file              format  type  num_seqs  sum_len  min_len  avg_len  max_len
b/chunk001.fa.gz  FASTA   RNA          1       25       25       25       25
b/chunk002.fa.gz  FASTA   RNA          1       25       25       25       25
b/chunk003.fa.gz  FASTA   RNA          1       25       25       25       25
b/chunk004.fa.gz  FASTA   RNA          1       24       24       24       24
PanZiwei commented 1 year ago

Hi Wei, Thanks for the prompt reply. Yeah I only have 1 sequence (2.2Mbp) in the input file. I want to divide the fasta into multiple fasta files with the seq_length=0.5Mbp, i.e., I want to divide the reference genome into 5 fragments (0-0.5Mbp, 0.5Mbp-1Mbp, 1Mbp-1.5Mbp, 1.5Mbp-2.0Mbp, 2.0Mbp-end). How can I achieve it? I tried kmcp utils split-genomes -m 1 -k 1 -n 5 ${input_fasta} -O ${output_path} --force, but it will divide the sequence evenly. I also tried seqkit sliding -s 5000000 -W 5000000 ${input_fasta} > ${output_fasta}, but the output didn't contain the region 2.0Mbp~end.

(nanotest) [c-panz@sumner-log2 ~]$  seqkit stats ${input_fasta}
file                     format  type  num_seqs    sum_len    min_len    avg_len    max_len
custom.fasta  FASTA   DNA          1  2,161,954  2,161,954  2,161,954  2,161,954
shenwei356 commented 1 year ago

output didn't contain the region 2.0Mbp~end

Please add the flag -g:

  -g, --greedy            greedy mode, i.e., exporting last subsequences even shorter than windows size
PanZiwei commented 1 year ago

Thanks for the hint! I can successfully achieve my goal with the command below:

seqkit sliding -s 5000000 -W 5000000 ${input_fasta} -g > ${output_fasta}
seqkit split2 --by-size 1 ${output_fasta} -O ${output_path}

One more question, for the example Generate GC content for ploting in seqkit sliding, can you explain a ilttle bit more about the seqkit sliding -s 5 -W 30? What is the 2nd column?

$ zcat hairpin.fa.gz \
    | seqkit sliding -s 5 -W 30 \
    | seqkit fx2tab -n -g
cel-let-7_sliding:1-30          50.00
cel-let-7_sliding:6-35          46.67
cel-let-7_sliding:11-40         43.33
cel-let-7_sliding:16-45         36.67
cel-let-7_sliding:21-50         33.33
cel-let-7_sliding:26-55         40.00
shenwei356 commented 1 year ago

It's just the GC content (percentage of bases G and C)

  -g, --gc                     print GC content
PanZiwei commented 1 year ago

I assume (C+G)% = (number of C + number of G) in the sequence of the window size / window size (30 in the example)?

shenwei356 commented 1 year ago

Yes, I think so. Am I missing something?

$ zcat hairpin.fa.gz \
    | seqkit sliding -s 5 -W 30 \
    | seqkit seq --rna2dna \
    | seqkit fx2tab -g -C G,C,A,T \
    | head -n 10 \
    | csvtk add-header -Ht -n 'name,seq,qual,GC-content,#G,#C,#A,#T' \
    | csvtk pretty -t

name                      seq                              qual   GC-content   #G   #C   #A   #T
-----------------------   ------------------------------   ----   ----------   --   --   --   --
cel-let-7_sliding:1-30    TACACTGTGGATCCGGTGAGGTAGTAGGTT          50.00        11   4    6    9
cel-let-7_sliding:6-35    TGTGGATCCGGTGAGGTAGTAGGTTGTATA          46.67        12   2    6    10
cel-let-7_sliding:11-40   ATCCGGTGAGGTAGTAGGTTGTATAGTTTG          43.33        11   2    6    11
cel-let-7_sliding:16-45   GTGAGGTAGTAGGTTGTATAGTTTGGAATA          36.67        11   0    8    11
cel-let-7_sliding:21-50   GTAGTAGGTTGTATAGTTTGGAATATTACC          33.33        8    2    8    12
cel-let-7_sliding:26-55   AGGTTGTATAGTTTGGAATATTACCACCGG          40.00        8    4    8    10
cel-let-7_sliding:31-60   GTATAGTTTGGAATATTACCACCGGTGAAC          40.00        7    5    9    9
cel-let-7_sliding:36-65   GTTTGGAATATTACCACCGGTGAACTATGC          43.33        7    6    8    9
cel-let-7_sliding:41-70   GAATATTACCACCGGTGAACTATGCAATTT          36.67        5    6    10   9
cel-let-7_sliding:46-75   TTACCACCGGTGAACTATGCAATTTTCTAC          40.00        4    8    8    10
PanZiwei commented 1 year ago

Thanks for the clarification and the code! Really helpful. Everything looks fine to me. I will close the issue then.