stjude / XenoCP

A cloud-based tool for mouse read cleansing in xenograft samples
Apache License 2.0
5 stars 3 forks source link

Does this need a default? #1

Closed adthrasher closed 5 years ago

adthrasher commented 5 years ago

Doesn't SplitSam have logic to determine the appropriate number of buckets automatically? Does the default need to be set here? https://github.com/adamdingliang/XenoCP/blob/32eab634de6d556d54a26403ff9c1607bb651278/cwl/xenocp.cwl#L27

adamdingliang commented 5 years ago

From the description and the code, it looks to me that in paired mode, number of buckets "-m" is required unless "-l" is specified.

$ java.sh org.stjude.compbio.sam.SplitSam Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/research/rgs01/scratch_lsf/java -XX:ParallelGCThreads=1 usage: java org.stjude.compbio.sam.SplitSam [OPTION]... PREFIX EXT Splits a sam/bam file into smaller pieces. Output is named with prefix PREFIX and extension EXT, with distinguishing parts in between. The extension determines the output format. PREFIX may contain path. You may split by reference name, read group, and/or by number of reads. If your data is paired-end, and you would like to preserve sort order while keeping mates in the same file, then you can use -m with -b (and optionally -l). In this mode, all inputs are divided by read name into B buckets. You can specify the number of buckets using -b, or compute by takingchromosome length divided by the -l value. If yo use -l, you still need -b for the no-ref case. -a suffix length for -n default 3 --add-unique-rgid-to-records add RGID to records where missing, if and only if there is exactly one RG in header -b number of buckets for -m splitting -c split by ref name (usually chromosome) -i input sam/bam file if not stdin -l seq len divisor to compute b on the fly -m keep mates together -M strips mate suffixes /1 and /2 -n split every 'arg' records -p filter out non-primary records -r split by read group -V validation stringency: STRICT, LENIENT, or SILENT (default: SILENT)

adthrasher commented 5 years ago

I guess the better question is, shouldn't this be determined based on input and not hard-coded?

adamdingliang commented 5 years ago

This runs on a single instance. The number of buckets is more relevant to the instance type. The default instance type is azure:mem3_ssd1_x16. So I set the default number of buckets to 15.

BTW, above is for the cloud. I'd like users to select the number of buckets based on the number of cores of their computing platform.

adthrasher commented 5 years ago

No longer necessary post-refactor.