kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Amalgkit 0.6.0.0 changelog #49

Closed Hego-CCTB closed 2 years ago

Hego-CCTB commented 3 years ago

This is going to be a bigger update, affecting multiple currently open issues. So I'll post the changelog in here and refer to this from the other issues.

Hego-CCTB commented 3 years ago

Changelog Amalgkit ver. 0.6.0.0

amalgkit csca

--curate_group \ 'root,flower,leaf' \


- Note: This was tested on a 9 species plant dataset retrieved, quantified and curated by `amalgkit`. That said, further testing is needed. Especially gene name format can cause issues. 
- Note: `dir_uncorrected_curate_group_mean`, `dir_curate_group_mean`,  `dir_sra`, `dir_tc` all point to the same directory, if the input is unchanged `curate` output. As such, these arguments are `inferred` by default. If there is a `curate/tables` folder in the `--out_dir` path,  amalgkit will find those files on its own.

## `amalgkit curate`

- Now throws a warning when transforming with TPM 
- Now throws an error when `cstmm` output files are detected (parsed from path) in combination with TPM transformation
- Now includes option `--one_outlier_per_iter yes|no`, which allows only 1 sample per same bioproject or same tissue to be removed per iteration of the outlier removal
- `check_within_tissue_correlation()` now removes samples below a pearson r of 0.2 (currently hard coded, but can be made an optional input in the future)
- `--cleanup 0|1` is now `plot_intermediate yes|no`. "yes" calculates and prints SVA correction after every single iteration of outlier removal. This can drastically increase runtimes.

## `amalgkit getfastq`

- truncated updated_metadata output files to only essential columns for `curate`. This comes with two benefits: lower filesize (which very slightly increases `curate` performance) and more importantly, same column number across all individual files
- obsoleted `--ascp` and all related options

## `amalgkit`
- added `amalgkit csca` subparsers

This should go up later today. I'm still debugging and I have to merge with the other updates today.
kfuku52 commented 3 years ago

Is there any option like --curate_group all to include all curate_group in the metadata table?

Hego-CCTB commented 3 years ago

If --curate_group is left none , it should parse out all unique values from the curate_group column and use that as input.

kfuku52 commented 3 years ago

Sounds good!

Hego-CCTB commented 3 years ago

Update is now live. https://github.com/kfuku52/amalgkit/commit/cbd6852060319083283ca9f062a106709c97e63d

kfuku52 commented 3 years ago

The curate_group column is missing in the metadata table. Could you update amalgkit metadata?

Hego-CCTB commented 3 years ago

Ah, it seems the column doesn't survive the last metadata step. There are 3 metadata sheets as output. curate_group is in the second output, but not in the third.

I'll investigate that.

kfuku52 commented 3 years ago

It seems that curate_group isn't used at all in transcriptome_curation.r. Am I missing something?

Hego-CCTB commented 3 years ago

Yeah, you are right. I'm gonna need to replace any reference to tissue with curate_group.

Hego-CCTB commented 3 years ago

Yeah, you are right. I'm gonna need to replace any reference to tissue with curate_group.

Amalgkit ver. 0.6.2.3