Closed PanZiwei closed 1 year ago
Is there only one sequence in the input file? Can you please check with the command below?
seqkit stats ${input_fasta}
seqkit split/split2
splits multiple sequences into some partitions/chunks/parts. e.g.,
$ seqkit stats ../tests/hairpin.fa
file format type num_seqs sum_len min_len avg_len max_len
../tests/hairpin.fa FASTA RNA 28,645 2,949,871 39 103 2,35
$ seqkit split2 -p 4 ../tests/hairpin.fa -O a
[INFO] split seqs from ../tests/hairpin.fa
[INFO] split into 4 parts
[INFO] write 7162 sequences to file: a/hairpin.part_001.fa
[INFO] write 7161 sequences to file: a/hairpin.part_002.fa
[INFO] write 7161 sequences to file: a/hairpin.part_003.fa
[INFO] write 7161 sequences to file: a/hairpin.part_004.fa
Did you intend to cut the input sequence into 4 segments? If so, you can use seqkit sliding
, but you have to compute the sliding window (-W
) and step size (-s
). Another method way is using kmcp utils split-genomes:
$ seqkit stats t.fa
file format type num_seqs sum_len min_len avg_len max_len
t.fa FASTA RNA 1 99 99 99 99
# overlap of adjacent segments can be set with --split-overlap
$ kmcp utils split-genomes -m 1 -k 1 -n 4 t.fa -O b --force
09:34:17.731 [INFO] splitting the reference genome ...
09:34:17.732 [INFO] genome size: 99, splitNumber: 4, splitOverlap: 0, splitSize: 25, step: 25, greedy: true
09:34:17.742 [INFO] seq length of the 4 chunks: 25, 25, 25, 24
$ seqkit stats b/*
processed files: 4 / 4 [======================================] ETA: 0s. done
file format type num_seqs sum_len min_len avg_len max_len
b/chunk001.fa.gz FASTA RNA 1 25 25 25 25
b/chunk002.fa.gz FASTA RNA 1 25 25 25 25
b/chunk003.fa.gz FASTA RNA 1 25 25 25 25
b/chunk004.fa.gz FASTA RNA 1 24 24 24 24
Hi Wei,
Thanks for the prompt reply. Yeah I only have 1 sequence (2.2Mbp) in the input file. I want to divide the fasta into multiple fasta files with the seq_length=0.5Mbp, i.e., I want to divide the reference genome into 5 fragments (0-0.5Mbp, 0.5Mbp-1Mbp, 1Mbp-1.5Mbp, 1.5Mbp-2.0Mbp, 2.0Mbp-end). How can I achieve it?
I tried kmcp utils split-genomes -m 1 -k 1 -n 5 ${input_fasta} -O ${output_path} --force
, but it will divide the sequence evenly.
I also tried seqkit sliding -s 5000000 -W 5000000 ${input_fasta} > ${output_fasta}
, but the output didn't contain the region 2.0Mbp~end.
(nanotest) [c-panz@sumner-log2 ~]$ seqkit stats ${input_fasta}
file format type num_seqs sum_len min_len avg_len max_len
custom.fasta FASTA DNA 1 2,161,954 2,161,954 2,161,954 2,161,954
output didn't contain the region 2.0Mbp~end
Please add the flag -g
:
-g, --greedy greedy mode, i.e., exporting last subsequences even shorter than windows size
Thanks for the hint! I can successfully achieve my goal with the command below:
seqkit sliding -s 5000000 -W 5000000 ${input_fasta} -g > ${output_fasta}
seqkit split2 --by-size 1 ${output_fasta} -O ${output_path}
One more question, for the example Generate GC content for ploting in seqkit sliding
, can you explain a ilttle bit more about the seqkit sliding -s 5 -W 30
? What is the 2nd column?
$ zcat hairpin.fa.gz \
| seqkit sliding -s 5 -W 30 \
| seqkit fx2tab -n -g
cel-let-7_sliding:1-30 50.00
cel-let-7_sliding:6-35 46.67
cel-let-7_sliding:11-40 43.33
cel-let-7_sliding:16-45 36.67
cel-let-7_sliding:21-50 33.33
cel-let-7_sliding:26-55 40.00
It's just the GC content (percentage of bases G and C)
-g, --gc print GC content
I assume (C+G)% = (number of C + number of G) in the sequence of the window size / window size (30 in the example)
?
Yes, I think so. Am I missing something?
$ zcat hairpin.fa.gz \
| seqkit sliding -s 5 -W 30 \
| seqkit seq --rna2dna \
| seqkit fx2tab -g -C G,C,A,T \
| head -n 10 \
| csvtk add-header -Ht -n 'name,seq,qual,GC-content,#G,#C,#A,#T' \
| csvtk pretty -t
name seq qual GC-content #G #C #A #T
----------------------- ------------------------------ ---- ---------- -- -- -- --
cel-let-7_sliding:1-30 TACACTGTGGATCCGGTGAGGTAGTAGGTT 50.00 11 4 6 9
cel-let-7_sliding:6-35 TGTGGATCCGGTGAGGTAGTAGGTTGTATA 46.67 12 2 6 10
cel-let-7_sliding:11-40 ATCCGGTGAGGTAGTAGGTTGTATAGTTTG 43.33 11 2 6 11
cel-let-7_sliding:16-45 GTGAGGTAGTAGGTTGTATAGTTTGGAATA 36.67 11 0 8 11
cel-let-7_sliding:21-50 GTAGTAGGTTGTATAGTTTGGAATATTACC 33.33 8 2 8 12
cel-let-7_sliding:26-55 AGGTTGTATAGTTTGGAATATTACCACCGG 40.00 8 4 8 10
cel-let-7_sliding:31-60 GTATAGTTTGGAATATTACCACCGGTGAAC 40.00 7 5 9 9
cel-let-7_sliding:36-65 GTTTGGAATATTACCACCGGTGAACTATGC 43.33 7 6 8 9
cel-let-7_sliding:41-70 GAATATTACCACCGGTGAACTATGCAATTT 36.67 5 6 10 9
cel-let-7_sliding:46-75 TTACCACCGGTGAACTATGCAATTTTCTAC 40.00 4 8 8 10
Thanks for the clarification and the code! Really helpful. Everything looks fine to me. I will close the issue then.
I am using
seqkit v2.4.0
and I tried to split my custom fasta by 4 parts with the commandseqkit split2 -p 4 -O ${output_path} ${input_fasta}
. However, the write function doesn't work. It only saved 1 single file instead of all parts.Would really appreciate it if any hints can be provided to solve the problem. Thanks!
Describe your issue