Closed mshadbolt closed 6 years ago
Hi Marion,
Sorry about that-- you're right; sample_id/sampleID is reserved, we convert sample_id to sampleID if it exists. If you can rename that column to something else in your metadata, it should fix this.
Hi Rory, thanks for the reply.
What format does the alternate metadata file need to be that is specified with the sampleMetadataFile
argument?
I tried modifying the config csv that I used to configure my bcbio run so it no longer has a column named sample_id
but I am still coming up with the same error. Can it be a csv or does it need to be in yaml format?
Hi @mshadbolt, when using the sampleMetadataFile
argument in the bcbioRNASeq()
load call, the file should be either CSV or Excel. It needs to contain a description
column that matches the sample directory names in the bcbio run, which were specified with description
in the bcbio YAML. R has some differences in what it considers valid names from Python, which is why we set sampleID
internally to sanitize the column names of the count matrices. Alternatively, you can edit the bcbio project-summary.yaml
file in the output, but that's a little tricker and I don't generally recommend it when needing to fix sample metadata.
Hi, thanks for the reply. I tried specifying the sample csv with the sampleMetadataFile
argument but it is still giving me the same error.
colnames(original_config)
[1] "description" "genotype" "sample_id" "sex" "pool"
[6] "qc_status" "batch" "phenotype"
So I understand this is giving the error because of the sample_id
column.
I then edited the the original config csv and saved with the following column names:
colnames(renamed_config)
[1] "description" "genotype" "sid" "sex" "pool" "qc_status" "batch" "phenotype"
But when I specify the new config file I get the same error
> bcb <- bcbioRNASeq(
+ uploadDir = final_path,
+ sampleMetadataFile = path_to_renamed_config,
+ interestingGroups = "genotype",
+ caller = "salmon",
+ level = "genes",
+ organism = "Homo sapiens")
Error in .returnSampleData(.) :
are_disjoint_sets : "sampleID" and colnames(data) have common elements: sampleID.
So it must still be reading the original csv/YAML for some reason even though I specify a new metadata file?
Currently the bcbioRNASeq()
generator function parses the YAML file and will set columns that have been updated from the sampleMetadataFile
, but I wrote it to keep the original columns if they're only defined the YAML. You're right that this behavior should be improved. In the meantime, you can edit the project summary YAML and remove the sample_id
lines...that should fix the issue.
Hi, thanks for the reply
It is still not working for me though. This is what I tried:
$RUN_NAME/config/$RUN_NAME.yaml
I used find/replace to change all sample_id
to patient_id
, I then changed the config metadata file ($RUN_NAME/config/$RUN_NAME.csv
) from sample_id
to patient_id
> colnames(renamed_config)
[1] "description" "genotype" "patient_id" "sex" "pool" "qc_status" "batch"
[8] "phenotype"
snippet from first patient in updated yaml (I have censored this to remove real ids)
details:
- algorithm:
aligner: star
analysis: RNA-seq
description: patient1
files:
- ~/patient1_1.fastq.gz
- ~/patient1_2.fastq.gz
genome_build: hg19
metadata:
batch: batch1
genotype: FFPE
phenotype: tumor
pool: pool_1
qc_status: Failed
patient_id: patient_1
sex: male
...
Or are you saying I have to not have that column at all in my data? (Less than ideal since I'm mainly interested in comparing 2 libraries with the same sample id...)
I just figured out you were talking about $RUN_NAME/final/project-summary.yaml
which I didn't really know existed before. I find replaced sample_id
with patient_id
and provided the path to the renamed_config and it appears to be working
That's correct about project-summary.yaml
. I'll improve this handling in the code and and some notes into the bcbioRNASeq()
documentation. Glad it's working!
Mike
Hi
I would like to use your R package to read in a bcbio rnaseq run but it fails with the following error:
I don't really understand the error or how I could go about fixing it. My best guess could be that something is going wrong because I have a column named "sample_id" in my metadata csv, but don't understand enough to know what it is clashing with.
Thanks