kraemer-lab / vneyard

Repository for GRAPEVNE modules
MIT License
0 stars 2 forks source link

(Sampler_Brito) Add sampler #35

Closed jsbrittain closed 3 months ago

jsbrittain commented 3 months ago

Add the Anderson Brito subsampler (https://github.com/andersonbrito/subsampler)

To test the module:

Screenshot: Screenshot 2024-07-25 at 10 50 13

resolves #2


Improvements that can be made:

jsbrittain commented 3 months ago

Docstring currently reads:

Subsampler (Anderson Brito)

A pipeline for subsampling genomic data based on epidemiological time series data. This pipeline was developed by Anderson Brito and is available at: https://github.com/andersonbrito/subsampler/tree/master.

Ports: data: Input data (metadata and case data) config: Configuration files (keep, remove and filter files), default none if unconnected

Params: Metadata file (str): Name of the metadata file Case data file (str): Name of the case data file Subsampler: keep file (str): File designating sequences to keep. This should be a plain-text (.txt) file containing one ID (e.g. EPI_ISL_402125) per line. Leave blank for none. remove file (str): File designating sequences to remove. This should be a plain-text (.txt) file containing one ID (e.g. EPI_ISL_12345678) per line. Leave blank for none. filter file (str): File containing sequence filters. This should be a tab-delimited file (.tsv) with columns action, column and value. action can contain "exclude", the column should contain the name of the column to filter on, and the value column should contain the value to filter on. Leave blank for none. ID Column (str): Column name for the ID GEO Column (str): Column name for the geographic location Date column (str): Column name for the date Baseline (decimal): Reference genome size (int): Reference genome size Max missing (int): Maximum number of missing data Seed number (int): Seed Start date (str): Start date End date (str): End date Unit (str): Time unit

jsbrittain commented 3 months ago

@joetsui1994 Could you take a look at the docstring in particular to ensure that it provideds sufficient information for the end user to run the module. Thanks!

joetsui1994 commented 3 months ago

@jsbrittain Looks great! I just tested it and it works.

I would suggest a few changes in the docstring: 1) Give either an example for start/end date, or the required format (seems to be YYYY-MM-DD)? 2) What are the options for time unit? (seems to be ['week', 'month', 'year', 'full']) 3) change "Baseline" to "Baseline sampling proportion" 4) "Max" to "Max."

Otherwise this looks good to me!