mskilab-org / dryclean

Irons out wrinkles in noisy coverage data using robust PCA
12 stars 14 forks source link

prepare_detergent failing when using all samples #13

Open pblaney opened 1 year ago

pblaney commented 1 year ago

Hello,

After collecting a test set of fragCounter coverage profiles for 4 normal samples, I attempted to run the dryclean workflow. I encountered the following error while trying the first step of creating the PoN in prepare_detergent:

pon_detergent <- prepare_detergent(normal.table.path = "/drycleanRun/test_ton.rds",
                                   use.all = TRUE,
                                   num.cores = 2,
                                   build = "hg38",
                                   path.to.save = "drycleanRun/",
                                   nochr = T,
                                   save.pon = T)

### OUTPUT ###
Starting the preparation of Panel of Normal samples a.k.a detergent
4 samples available
Using all samples
PAR file not provided, using hg38 default. If this is not the correct build, please provide a GRange object delineating for corresponding build
PAR read
Checking for existence of files
4 files present
  |=====================================================================================================================| 100%, Elapsed 07:21
Error in setattr(ans, "names", c(keep.names, paste0("V", seq_len(length(ans) -  : 
  'names' attribute [1] must be the same length as the vector [0]

While troubleshooting, it seems like others have encountered the same error, but at a different stage of the workflow (#2). Based on the output message, it looks like the error occurs within pbmclapply function call at line 259 although I am not exactly sure where.

I then decided to test prepare_detergent under the other possible approaches instead of using all samples. Interestingly, using either of the two alternative options choose.randomly = TRUE or choose.by.clustering = TRUE both executed without an error.

Here using choose.randomly = TRUE and selecting 2 of the 4 samples:

pon_detergent <- prepare_detergent(normal.table.path = "/drycleanRun/test_ton.rds",
                                   use.all = FALSE,
                                   choose.randomly = TRUE,
                                   number.of.samples = 2,
                                   choose.by.clustering = FALSE,
                                   num.cores = 2,
                                   build = "hg38",
                                   path.to.save = "drycleanRun/",
                                   nochr = T,
                                   save.pon = T)

### OUTPUT ###
Starting the preparation of Panel of Normal samples a.k.a detergent
4 samples available
Selecting 2 normal samples randomly
PAR file not provided, using hg38 default. If this is not the correct build, please provide a GRange object delineating for corresponding build
PAR read
Checking for existence of files
2 files present
  |============================================================================================================| 100%, Elapsed 03:28
Starting decomposition
This is version 2
Warning: Item 1 has 3031053 rows but longest item has 15155223; recycled with remainder.Finished making the PON or detergent and saving it to the path provided

And here using choose.by.clustering = TRUE

pon_detergent <- prepare_detergent(normal.table.path = "/drycleanRun/test_ton.rds",
                                   use.all = FALSE,
                                   choose.randomly = FALSE,
                                   number.of.samples = 2,
                                   choose.by.clustering = TRUE,
                                   num.cores = 2,
                                   build = "hg38",
                                   path.to.save = "drycleanRun/",
                                   nochr = T,
                                   save.pon = T)

### OUTPUT ###
Starting the preparation of Panel of Normal samples a.k.a detergent
4 samples available
Starting the clustering
Starting decomposition on a small section of genome
This is version 2
Starting clustering
PAR file not provided, using hg38 default. If this is not the correct build, please provide a GRange object delineating for corresponding build
PAR read
Checking for existence of files
2 files present
  |============================================================================================================| 100%, Elapsed 01:52
Starting decomposition
This is version 2
Warning: Item 1 has 3031053 rows but longest item has 15155223; recycled with remainder.Finished making the PON or detergent and saving it to the path provided

The output detergent.rds is in working order as I was able to run start_wash_cycle without any problems. I will likely use the clustering method for further analysis but wanted to point out this issue for others who encounter it.

Best, Patrick

zining01 commented 1 year ago

Hi Patrick,

Thanks for letting us know about the error. I have not encountered this before on our samples. What happens if you set number.of.samples to the total number of available samples when choosing randomly?

Zi-Ning

pblaney commented 1 year ago

Hello Zi-Ning,

I finally had some time to test out your suggestion. Unfortunately, using choose.randomly with setting number.of.samples equal to the total number of samples leads to the same error as use.all.

Furthermore, choose.randomly works when I set the number of samples to 2 out of 4 but it fails when I use 3 out of 4. The same occurs with choose.by.clustering.

I'll keep testing to see if I can determine a pattern or give more information for debugging if others experience the same issue. I plan to greatly increase the input sample size so this may help resolve this as well.