are_disjoint_sets error when loading bcbio RNAseq run

hbc / bcbioRNASeq

R package for bcbio RNA-seq analysis.

https://bioinformatics.sph.harvard.edu/bcbioRNASeq

GNU Affero General Public License v3.0

58 stars 21 forks source link

are_disjoint_sets error when loading bcbio RNAseq run #116

Closed mshadbolt closed 6 years ago

mshadbolt commented 6 years ago

I would like to use your R package to read in a bcbio rnaseq run but it fails with the following error:

> bcb <- bcbioRNASeq(
+     uploadDir = "$PATH_TO_MY_RUN/final",
+     interestingGroups = "genotype",
+     level = "genes",
+     organism = "Homo sapiens"
+ )
Error in .returnSampleData(.) : 
  are_disjoint_sets : "sampleID" and colnames(data) have common elements: sampleID.

I don't really understand the error or how I could go about fixing it. My best guess could be that something is going wrong because I have a column named "sample_id" in my metadata csv, but don't understand enough to know what it is clashing with.

Thanks

roryk commented 6 years ago

Hi Marion,

Sorry about that-- you're right; sample_id/sampleID is reserved, we convert sample_id to sampleID if it exists. If you can rename that column to something else in your metadata, it should fix this.

mshadbolt commented 6 years ago

Hi Rory, thanks for the reply. What format does the alternate metadata file need to be that is specified with the sampleMetadataFile argument?

I tried modifying the config csv that I used to configure my bcbio run so it no longer has a column named sample_id but I am still coming up with the same error. Can it be a csv or does it need to be in yaml format?

mjsteinbaugh commented 6 years ago

Hi @mshadbolt, when using the sampleMetadataFile argument in the bcbioRNASeq() load call, the file should be either CSV or Excel. It needs to contain a description column that matches the sample directory names in the bcbio run, which were specified with description in the bcbio YAML. R has some differences in what it considers valid names from Python, which is why we set sampleID internally to sanitize the column names of the count matrices. Alternatively, you can edit the bcbio project-summary.yaml file in the output, but that's a little tricker and I don't generally recommend it when needing to fix sample metadata.

mshadbolt commented 6 years ago

Hi, thanks for the reply. I tried specifying the sample csv with the sampleMetadataFile argument but it is still giving me the same error.

colnames(original_config)
[1] "description" "genotype"    "sample_id"   "sex"         "pool"       
[6] "qc_status"   "batch"       "phenotype"

So I understand this is giving the error because of the sample_id column. I then edited the the original config csv and saved with the following column names:

colnames(renamed_config)
[1] "description" "genotype"    "sid"         "sex"         "pool"        "qc_status"   "batch"       "phenotype"

But when I specify the new config file I get the same error

> bcb <- bcbioRNASeq(
+         uploadDir = final_path,
+         sampleMetadataFile = path_to_renamed_config,
+         interestingGroups = "genotype",
+         caller = "salmon",
+         level = "genes",
+         organism = "Homo sapiens")
Error in .returnSampleData(.) : 
  are_disjoint_sets : "sampleID" and colnames(data) have common elements: sampleID.

So it must still be reading the original csv/YAML for some reason even though I specify a new metadata file?

mjsteinbaugh commented 6 years ago

Currently the bcbioRNASeq() generator function parses the YAML file and will set columns that have been updated from the sampleMetadataFile, but I wrote it to keep the original columns if they're only defined the YAML. You're right that this behavior should be improved. In the meantime, you can edit the project summary YAML and remove the sample_id lines...that should fix the issue.

mshadbolt commented 6 years ago

Hi, thanks for the reply

It is still not working for me though. This is what I tried:

in the $RUN_NAME/config/$RUN_NAME.yaml I used find/replace to change all sample_id to patient_id, I then changed the config metadata file ($RUN_NAME/config/$RUN_NAME.csv) from sample_id to patient_id

> colnames(renamed_config)
[1] "description" "genotype"    "patient_id"  "sex"         "pool"        "qc_status"   "batch"      
[8] "phenotype"

snippet from first patient in updated yaml (I have censored this to remove real ids)

details:
- algorithm:
    aligner: star
  analysis: RNA-seq
  description: patient1
  files:
  - ~/patient1_1.fastq.gz
  - ~/patient1_2.fastq.gz
  genome_build: hg19
  metadata:
    batch: batch1
    genotype: FFPE
    phenotype: tumor
    pool: pool_1
    qc_status: Failed
    patient_id: patient_1
    sex: male
...

Or are you saying I have to not have that column at all in my data? (Less than ideal since I'm mainly interested in comparing 2 libraries with the same sample id...)

mshadbolt commented 6 years ago

I just figured out you were talking about $RUN_NAME/final/project-summary.yaml which I didn't really know existed before. I find replaced sample_id with patient_id and provided the path to the renamed_config and it appears to be working

mjsteinbaugh commented 6 years ago

That's correct about project-summary.yaml. I'll improve this handling in the code and and some notes into the bcbioRNASeq() documentation. Glad it's working!

Mike