Clarify use of `--samples` and `--partition` in WGSCoveragePlotter

lindenb / jvarkit

Java utilities for Bioinformatics

https://jvarkit.readthedocs.io/

Other

478 stars 132 forks source link

Clarify use of `--samples` and `--partition` in WGSCoveragePlotter #177

Closed pvanheus closed 3 years ago

pvanheus commented 3 years ago

Subject of the issue

I am writing a Galaxy wrapper for jvarkit WGSCoveragePlotter but I am confused at how to use --samples and --partition. Do you perhaps have some sample data for which these flags are appropriate? My main use case is in pathogen genomics, where I typically have a single sample in a BAM, so I am holding back on including support for these flags for now, but would like to include them in the future.

Your environment

version of jvarkit: commit id: 1f97a3401f679ffc187281bcf2eaac9399254ed9
version of java: java 11
the value of ${JAVA_HOME} (not used)
which OS: Ubuntu Linux 20.04

lindenb commented 3 years ago

--partition is "what is a sample ?" : should we use the SM field (default) in the RG header ? should we use another field in the RG header like the LB (library) etc... --samples is to limit the result above in case where you have more that one result (multiple LB but you only want to display one)

wm75 commented 3 years ago

So if --samples is a comma-separated list of identifiers (e.g. several libraries or several sample names), multiple figures will be generated within a single plot file?

lindenb commented 3 years ago

multiple figures will be generated within a single plot file?

no, it's just for filtering, only reads with matching --samples will be used.

lindenb commented 3 years ago

but I think you can ignore both options for a simple wrapper.

wm75 commented 3 years ago

Agreed, users could also prefilter their data to the read groups they want with e.g. samtools view. Thanks for the clarification!

pvanheus commented 3 years ago

Thanks @lindenb - so in terms of partition options:

samples means match on SM, library means match on LB, platform means match on PL

What do sample_by_platform, sample_by_platform_by_center, any and readgroup mean?

lindenb commented 3 years ago

same logic as GATK: https://gatk.broadinstitute.org/hc/en-us/articles/360051307491-DepthOfCoverage-BETA-#--partition-type

sample_by_platform : RG/SM+RG/PL , sample_by_platform_by_center : RG/SM+RG/PL+RG/CN readgroup: RG:ID any: everything

pvanheus commented 3 years ago

I'll include that link in the documentation. I must admit I am having trouble getting the any option to work (with this data) but I'll leave it in there for now.