markowetzlab / scDNAseq-workflow

calling absolute copy numbers on single cell DNA sequencing data - stagging repo for public release
4 stars 2 forks source link

running scAbsolute #7

Closed ljwharbers closed 11 months ago

ljwharbers commented 11 months ago

Hi,

I saw the biorxiv preprint and was interested in trying out scAbsolute to test out the cell cycle prediction / replication status. I've been trying to run it through the snakemake, however I run into some issues pulling the container from AWS. I'm not sure whether thats because of some configurations issues with my aws cli or because it is in a protected storage.

Since I don't need it working in a workflow and am currently only interested in the cell cycle prediction, I was trying to run the script run_scAbsolute.R, tweaking it slightly for my input bam files.

Now when it gets to the function scAbsolute(), it succesfully extract reads however then it crashes. See below the output from running it on 20 test bam files.

## run the core algorithm - using scAbsolute wrapper function
    scaledCN = scAbsolute(filePaths, method=method, globalModel=globalModel, binSize=binSize, genome=genome,
               minLength=minLength, batchSize=20, testStatistic=testStatistic, limitPloidy=limitPloidy,
               minPloidy=minPloidy, maxPloidy=maxPloidy, ploidyWindow=ploidyWindow, ploidyRegion=ploidyRegion, selectRegion=selectRegion,
               splitPerChromosome=splitPerChromosome, optimizeSegmentation=optimizeSegmentation,
               max_iterations=max_iterations, hmm_path=hmm_path, change_prob=change_prob, max_states=max_states,
               randomSeed=randomSeed, outputPath=RESULTPATH, outputSegmentation=FALSE, debug=TRUE)
    MS101_AACAGAGTGCA.dedup_q30 (1 of 20): extracting reads ... binning ...
    MS101_AACAGCCGCGA.dedup_q30 (2 of 20): extracting reads ... binning ...
    MS101_AACGACAGTGG.dedup_q30 (3 of 20): extracting reads ... binning ...
    MS101_AACTGCTGAAG.dedup_q30 (4 of 20): extracting reads ... binning ...
    MS101_AAGAATTCCTC.dedup_q30 (5 of 20): extracting reads ... binning ...
    MS101_AAGACATACGA.dedup_q30 (6 of 20): extracting reads ... binning ...
    MS101_AAGCAGCCGTT.dedup_q30 (7 of 20): extracting reads ... binning ...
    MS101_AAGCGGATTCG.dedup_q30 (8 of 20): extracting reads ... binning ...
    MS101_AAGCTCGCCAA.dedup_q30 (9 of 20): extracting reads ... binning ...
    MS101_AAGGCAGTCAT.dedup_q30 (10 of 20): extracting reads ... binning ...
    MS101_AAGGTACCTGG.dedup_q30 (11 of 20): extracting reads ... binning ...
    MS101_AAGTCTAGGTG.dedup_q30 (12 of 20): extracting reads ... binning ...
    MS101_AAGTCTGATAC.dedup_q30 (13 of 20): extracting reads ... binning ...
    MS101_AATACCGACCG.dedup_q30 (14 of 20): extracting reads ... binning ...
    MS101_AATCGCTCCGC.dedup_q30 (15 of 20): extracting reads ... binning ...
    MS101_AATGAGGTGGC.dedup_q30 (16 of 20): extracting reads ... binning ...
    MS101_AATGCGCGCAC.dedup_q30 (17 of 20): extracting reads ... binning ...
    MS101_AATGTAAGGCG.dedup_q30 (18 of 20): extracting reads ... binning ...
    MS101_AATGTCTCACA.dedup_q30 (19 of 20): extracting reads ... binning ...
    MS101_AATTAAGCCTG.dedup_q30 (20 of 20): extracting reads ... binning ...
                                                                                                                                                                                     Note: Residual filter missing for chromosomes: X, Y
[1] "AA readData runtime 1.15203013817469"
Error in dyn.load(file.path(BASEDIR, "data/changepoint/PELT.so")) : 
  unable to load shared object '/home/luukharbers/scAbsolute/data/changepoint/PELT.so':
  /home/luukharbers/scAbsolute/data/changepoint/PELT.so: cannot open shared object file: No such file or directory

I am not sure where I am able to find the PELT.so file or if I will be running into more missing files without the docker container.

Thanks, Luuk

leachim commented 11 months ago

Apologies. I just updated the image, we are now hosting it on a public repository. You should be able to run it with that image (v3.5.1, see config file). Please let me know if there are any other issues, we just updated the workflow to support a mouse genome, and maybe there have been some regressions. @shafighi can you try to rerun the latest master just to make sure it's still working. I had to merge quite a lot of changes from you and Austin, so we need to make sure it's all still working okay.

leachim commented 11 months ago

Also, I highly recommend to only run this with the container. Unfortunately, we have quite a few dependencies (C, Python, R), so it's much less hassle if you just run the snakemake workflow and work with the QDNAseq object that it produces.

ljwharbers commented 11 months ago

Thanks for the quick reply!

The updated image worked nicely indeed. Managed to run it on a few test bam files from start to finish, thanks! Looking through the output data and reading through the manuscript methods. You define S-phase cells based on a threshold (e.g. 1.5 sd) from the median. Is this based off of cellcycle.cmi_yrT or a different parameter?

Many thanks again, Luuk

PS: Think it's good to keep in mind that not everyone will have non-deduplicated bam files as input, Since my bamfiles are already deduplicated and I don't want to re-preprocess my files, I had to change some scripts in scAbsolute and run that part out of the containers.

leachim commented 11 months ago

Hi Luuk, the method to read through is predict_replicating in scAbsolute/R/core.R, it essentially does the analysis. https://github.com/markowetzlab/scAbsolute/blob/main/R/core.R

As to your question, the variable used mainly for cellcycle detection is cellcycle.kendall.repTime.weighted.median.cor.corrected, but we also use outliers (based of boxplot whiskers or iqr) with the cellcycle.cmi_yrT variable. But the main variable, that we refer to as cycling activity is cellcycle.kendall.repTime.weighted.median.cor.corrected. If you want to check your data in detail, you can plot both of them, they are somewhat correlated, but you will get better performance if you use both.

You can use the predict_replicating to predict which cells are S-phase, but it's worth plotting the distribution similar to the paper to see if the cutoff is sensible.

Regarding your message about deduplicated bam files, I am not sure I understand. Is there an issue if you run it with de-duplicated reads? Ideally, it should run with both, but I don't see how running deduplication on already deduplicated files would make a difference.

ljwharbers commented 11 months ago

Great, thanks so much for the info!

Regarding the deduplicated reads, you're right. If your reads are already deduplicated it shouldn't really matter. However, some methods (like ours) work with UMIs to deduplicate and deduplication with MarkDuplicates would (for technical reasons) remove almost all of the reads. Of course, I understand that this is only the case for a minority of people.