sanger-tol / treeval

Pipelines for the production of Treeval data
https://pipelines.tol.sanger.ac.uk/treeval
Other
20 stars 2 forks source link

[1.2.0 - Ancient Destiny] Chunk fasta reads for better parallelization for revio pacbio data #285

Open DLBPointon opened 3 months ago

DLBPointon commented 3 months ago

Description of feature

The size of the revio data is huge, this needs to be split into n = (reads / 10million) files. Mapping and then merge the output.

yumisims commented 3 months ago

if fasta size > 10G, then split the fasta.gz into N chunks, N= round( size_of_fasta/10) pyfasta split -n N {sample}.fasta.gz

mcshane commented 3 months ago

@yumisims @DLBPointon. Maybe use https://nf-co.re/modules/seqkit_split2 ?

yumisims commented 3 months ago

or just zcat {sample}.fasta.gz | awk '/^>/{n++} { print > ("chunk_" int(n/N) ".fasta") }' let's see

mcshane commented 3 months ago

seqkit split2 is multithreaded and will output gzipped chunks