[1.2.0 - Ancient Destiny] Chunk fasta reads for better parallelization for revio pacbio data

sanger-tol / treeval

Pipelines for the production of Treeval data

https://pipelines.tol.sanger.ac.uk/treeval

Other

20 stars 2 forks source link

Open DLBPointon opened 3 months ago

DLBPointon commented 3 months ago

The size of the revio data is huge, this needs to be split into n = (reads / 10million) files. Mapping and then merge the output.

yumisims commented 3 months ago

if fasta size > 10G, then split the fasta.gz into N chunks, N= round( size_of_fasta/10) pyfasta split -n N {sample}.fasta.gz

mcshane commented 3 months ago

yumisims commented 3 months ago

or just zcat {sample}.fasta.gz | awk '/^>/{n++} { print > ("chunk_" int(n/N) ".fasta") }' let's see

mcshane commented 3 months ago

seqkit split2 is multithreaded and will output gzipped chunks