Closed adthrasher closed 5 years ago
From the description and the code, it looks to me that in paired mode, number of buckets "-m" is required unless "-l" is specified.
$ java.sh org.stjude.compbio.sam.SplitSam
Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/research/rgs01/scratch_lsf/java -XX:ParallelGCThreads=1
usage: java org.stjude.compbio.sam.SplitSam [OPTION]... PREFIX EXT
Splits a sam/bam file into smaller pieces. Output is
named with prefix PREFIX and extension EXT, with
distinguishing parts in between. The extension
determines the output format. PREFIX may contain path.
You may split by reference name, read group, and/or by
number of reads.
If your data is paired-end, and you would like to preserve
sort order while keeping mates in the same file, then you
can use -m with -b (and optionally -l). In this mode, all
inputs are divided by read name into B buckets. You can
specify the number of buckets using -b, or compute by
takingchromosome length divided by the -l value. If yo use
-l,
you still need -b for the no-ref case.
-a
I guess the better question is, shouldn't this be determined based on input and not hard-coded?
This runs on a single instance. The number of buckets is more relevant to the instance type. The default instance type is azure:mem3_ssd1_x16. So I set the default number of buckets to 15.
BTW, above is for the cloud. I'd like users to select the number of buckets based on the number of cores of their computing platform.
No longer necessary post-refactor.
Doesn't SplitSam have logic to determine the appropriate number of buckets automatically? Does the default need to be set here? https://github.com/adamdingliang/XenoCP/blob/32eab634de6d556d54a26403ff9c1607bb651278/cwl/xenocp.cwl#L27