Closed vladimirsouza closed 6 months ago
Per discussion in lab meeting and on Slack here, I think we should update the Snakemake-integrated script that generates the various vst-matrix-Hugo_Symbol_tidy.tsv
files and exclude the genome_sample_id
and capture_sample_id
columns. These will continue to cause issues because there are biopsies with multiple sample IDs, so data duplication will be an issue.
To preserve backwards functionality, the join_with
argument can be preserved, but its effect should be to join the metadata table provided with these_samples_metadata
to the tidy expression table by biopsy_id
and patient_id
. If join_with
is set to genome_sample_id
then the function can add the genome_sample_id
column based on the provided metadata.
The function should also identify when a biopsy_id
is repeated for any given ensembl_gene_id
column. Users will have to determine which mrna_sample_id
to prioritize based on sample metadata that doesn't (and shouldn't) exist in the vst-matrix-Hugo_Symbol_tidy.tsv
table. The updated function could have an argument for a user to prioritize a protocol (Ribodepletion vs PolyA) and a preservation (any of the values in the ffpe_or_frozen column), which would ensure that any RNAseq data returned matches these specifications and to pre-empt data duplication per biopsy.
This issue was fixed by the Fix get_gene_expression #33 PR.
The error is in
get_gene_expression
.get_gene_cn_and_expression
usesget_gene_expression
internally.To reproduce the error:
The error happens because
grep -w -F -e Hugo_Symbol -e MYC /projects/nhl_meta_analysis_scratch/gambl/results_local/icgc_dart/DESeq2-0.0_salmon-1.0/mrna--gambl-icgc-all/vst-matrix-Hugo_Symbol_tidy.tsv
returns a table whereingenome_sample_id
column contains only NAs --- the file doesn't contain agenome_sample_id
column.capture_sample_id
column is also missing.For more details, see the discussion in this Slack thread.