Standard single-cell DNA-seq data

raphael-group / chisel

CHISEL -- Copy-number Haplotype Inference in Single-cell by Evolutionary Links

BSD 3-Clause "New" or "Revised" License

37 stars 11 forks source link

Standard single-cell DNA-seq data #9

Closed tamaraprieto closed 4 years ago

tamaraprieto commented 4 years ago

Dear Simone, Congratulations for the tool, it is great being able to call allele-specific copy number in single cells and get the most of the data. At our lab, we have generated standard single-cell WGS data (not from 10X Genomics) at intermediate sequencing depths (>5X). I was wondering if you are planning to implement the functionality of working with the standard BAM format in CHISEL, or you recommend a bulk standard allele-specific copy number caller as Sequenza instead (our data is not really low coverage). Any advice/suggestions would be very much appreciated. Tamara

simozacca commented 4 years ago

Thank you for the interest in CHISEL!

Several collaborators and us have successfully applied CHISEL to several datasets where the sequencing reads of each cell are reported in a distinct BAM file, for example DOP-PCR datasets (I assume this is the "standard BAM format" you mentioned). The procedure that we applied to do this is to merge all the BAM files into a unique BAM file after appending a unique barcode to each cell. More specifically, this is a two-step procedure:

Appending barcode: You should append a unique barcode to each cell and you do this by:
- Generate a list of unique barcodes for all cells; for example in *nix systems you can do this for ${N} cells with the following command: for BARCODE in {A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}; do echo ${BARCODE}; done | shuf -n ${N}
- Append each barcode ${BARCODE} to all the sequencing reads in a cell-specific BAM file ${BAM}; you can do this with your favorite scripting tool, however a very easy option is to use addreplacerg command of SAMtools as follows samtools addreplacerg -r 'ID:CB:Z:${BARCODE}\tSM:CELL_BARCODE' ${BAM} -o ${BAM}.barcoded. Note that, until the new release of CHISEL will be available very soon, you always need to have CB:Z: as a prefix of every barcode.
Merging cells: You should merge all the barcoded BAM files ${BAM1}...${BAMN} above into a unique BAM file; this can be done very efficiently with the merge command of SAMtools: samtools merge merged.bam ${BAM1}...${BAMN}

For your specific data, the higher coverage per cell should allow you to use a smaller bin size, for example 1M bins; so you could try to change this with the flag -b 1Mb of CHISEL. However, this also depends on the total number of cells that you are analyzing; more information about this will be soon available in the final version of the manuscript. Please note that the new release of CHISEL will be available very soon, with improvements and complete guidelines for QC and recommendations. I will also soon add a CHISEL command to automatically perform this barcoding and merging procedure.

Lastly, I would strongly recommend that you use single-cell specific methods to call CNAs from single-cell sequencing data: calling CNAs from bulk tumor samples is a very different problem than calling CNAs from single cells since tumor purity and subclonal mixtures are important factors in bulk samples. Therefore, the assumptions and models of bulk-tumor methods do not hold for single cells and may lead to misleading results.

Please do not hesitate to let me know if you have any additional question or doubt, I will temporarily keep the issue open in case of further questions

tamaraprieto commented 4 years ago

This is so useful! I will try to implement it in the next days and let you know once I have it. Thank you very much for the prompt response and advice!

tamaraprieto commented 4 years ago

Hi again, I have been trying to follow your steps and I am uncertain about how to continue. samtools addreplacerg is not performing variable expansion without double quotes and I can easily change it but I am not sure how the RG should exactly look like

I have seen in one of the demos that the reads look like this (CB:Z is not in RG ID) but this info does not help me: A00228:193:HCKNNDMXX:2:2112:9444:31438 99 chr1 10033 0 77M1I6M = 10146 138 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAAACCCTAACCCAAACCCTAACCCAGACCCTA FFFFFFFFFFFFFFFFFF:FFFFFFFFFF,FFFF:FFFF:::,:::,,:,FF,,FF::,,,,:F,:,,FF,,,,,,,,:,,:,, NM:i:3 MD:Z:52T11T18 AS:i:67 XS:i:66 CR:Z:CATTCGCGTGTGTATC CY:Z:FFFFFFFFFFFFFFFF CB:Z:CATTCGCGTGTGTATC-1 BC:Z:AGCTGGAT QT:Z:FFFFFFFF GP:i:10032 MP:i:10170 RG:Z:breast_tissue_E_2k:MissingLibrary:1:HCKNNDMXX:2

Below there is one of my reads and I would be most grateful if you could tell me how should look like. If you're busy do not worry, I can try with different small examples. A00500:57:HGTY7DSXX:4:1272:7654:21605 99 21 48119822 2 67M81S = 48119790 27 GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGATAGGTTTAGGCTTAGCGAAAGGCAGAGTATAAAGCGCGTGGGATGTAGTGTGAGGGATAGGGTGAGGGATAGGGTGAGGGTTAGGGTGAGG 9;<<>:<<<<><>===?=>===?=?>>>@>?>>>@>?>>>@>?>>>A-.??A8@:.+??@.-..7;./@A?./'+,@../*.*--(,(+9-.&.&,,>-*:,>(:,,9,>:=-*,7--8,,*>--*,>--8:,*--8:-7?9>--*?= MC:Z:89S59M MD:Z:56T4G5 PG:Z:MarkDuplicates.W RG:Z:CLL04.BM.03-HGTY7DSXX_4_12 NM:i:2 MQ:i:2 AS:i:57 XS:i:57

Thanks in advance

simozacca commented 4 years ago

In the barcoded BAM files generated by 10X Genomics Cell Ranger DNA pipeline (as the one in your example), the barcode of each sequencing read is indicated by the tag CB:Z: (for example you can see the tag CB:Z:CATTCGCGTGTGTATC-1 for the first read). So you could simply add a similar tag to every read of each cell (using a unique barcode) using a script of your preference.

My suggestion was a trick to simplify this process (while waiting the new version of CHISEL). The trick is that the command samtools addreplacerg is a very efficient command to add an RG:Z: tag to every sequencing read in a BAM file. For example, the command:

samtools addreplacerg -r 'ID:EXAMPLE\tSM:CELL_BARCODE' ${BAM} -o ${BAM}.barcoded

will generate a new BAM file ${BAM}.barcoded by appending the tag RG:Z:EXAMPLE to every sequencing read in ${BAM}. Therefore, if you run the following command:

samtools addreplacerg -r 'ID:CB:Z:CATTCGCGTGTGTATC-1\tSM:CELL_BARCODE' ${BAM} -o ${BAM}.barcoded

then every read in the new BAM file will have the tag RG:Z:CB:Z:CATTCGCGTGTGTATC-1 at the end. Since CHISEL will ignore the prefix RG:Z: your BAM file can be thus run through the CHISEL pipeline. Note that this is just an example and you should pick a different barcode for each cell. Moreover, in your example your read should look like the following, assuming that the read is from a cell for which you chose the unique barcode AAAAAA:

A00500:57:HGTY7DSXX:4:1272:7654:21605 99 21 48119822 2 67M81S = 48119790 27 GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGATAGGTTTAGGCTTAGCGAAAGGCAGAGTATAAAGCGCGTGGGATGTAGTGTGAGGGATAGGGTGAGGGATAGGGTGAGGGTTAGGGTGAGG 9;<<>:<<<<><>===?=>===?=?>>>@>?>>>@>?>>>@>?>>>A-.??A8@:.+??@.-..7;./@A?./'+,@../*.*--(,(+9-.&.&,,>-*:,>(:,,9,>:=-*,7--8,,*>--*,>--8:,*--8:-7?9>--*?= MC:Z:89S59M MD:Z:56T4G5 PG:Z:MarkDuplicates.W RG:Z:CLL04.BM.03-HGTY7DSXX_4_12 NM:i:2 MQ:i:2 AS:i:57 XS:i:57   CB:Z:AAAAAA

or, equivalently using the trick above, it could look like:

A00500:57:HGTY7DSXX:4:1272:7654:21605 99 21 48119822 2 67M81S = 48119790 27 GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTGGGTTAGGGATAGGTTTAGGCTTAGCGAAAGGCAGAGTATAAAGCGCGTGGGATGTAGTGTGAGGGATAGGGTGAGGGATAGGGTGAGGGTTAGGGTGAGG 9;<<>:<<<<><>===?=>===?=?>>>@>?>>>@>?>>>@>?>>>A-.??A8@:.+??@.-..7;./@A?./'+,@../*.*--(,(+9-.&.&,,>-*:,>(:,,9,>:=-*,7--8,,*>--*,>--8:,*--8:-7?9>--*?= MC:Z:89S59M MD:Z:56T4G5 PG:Z:MarkDuplicates.W RG:Z:CLL04.BM.03-HGTY7DSXX_4_12 NM:i:2 MQ:i:2 AS:i:57 XS:i:57   RG:Z:CB:Z:AAAAAA

As I said, this is a trick to make the process faster, but you can use your favorite script to simply add CB:Z:${BARCODE} at the end of each read.

Is this explanation clearer?

tamaraprieto commented 4 years ago

Very clear, thank you!And just a quick question, do you recommend running Chisel independently for each chromosome (-c CHROMOSOMES parameter) for parallelizing the process?Or is there a global estimation of parameters or something similar?

simozacca commented 4 years ago

CHISEL is fully parallelized and will automatically split independent computations, for example across chromosome, bins, or cells according to the different steps. By default CHISEL uses all the CPUs on the machine but you can change this using the option -j,--jobs. In general, the higher the number of CPUs is the faster the process is.

Instead, the option -c should be use only if you want to limit the CHISEL's analysis to some chromosomes but if you want to analyze the whole genome than you should NOT use/change -c (by default all automosomes will be considered). This is because not all the computations of CHISEL are independent across chromosomes; there are some steps (e.g. the global clustering or the clone inference) which leverage the global information across all chromosomes and the more information you have the better is.

tamaraprieto commented 4 years ago

Hi again, I am getting an error when running CHISEL on my dataset (Log.txt). Demo ran without problems. Could be related to the use of a reference genome without 'chr' within the chromosome names?I am using hs37d5 but I don't know if it is supported. Thanks in advance.

simozacca commented 4 years ago

From the log, the issue seems that no cell has been found ("[2020-Jun-15 18:58:29]Number of selected cells: 0").

First, you should check whether this is due to a too high threshold on the minimum number of reads: the default value is of 100,000 sequencing reads per cell, which is the standard used for data similar to those generated by 10X Genomics. However, you should vary the value according to your setup, in particular what is the expected sequencing coverage per cell of your data? Also, is it WES or WGS? My suggestion is to try to set -m 0 to see if you can recover any cell; if this solves the issue than I would suggest to adjust that threshold according to the number of reds that you observed in you data per cell.

Second, if you still have an error when using -m 0, this could be an error in the barcoded BAM file. If this happens could you please share with me the first 30 lines of your BAM file by using the following command:

samtools view ${BAM} | head -n 30

tamaraprieto commented 4 years ago

Thank you!This is WGS with an average sequencing depth of 10.5X per cell. I am still getting the error with -m 1 (-m 0 says The minimum number of reads must be positive!) and should have millions of reads.

So as you say this can be an error in the BAM file. These are the first 30 lines HeadBam.txt

simozacca commented 4 years ago

It seems that the barcodes are properly present in the BAM file and CHISEL is able to extract those, so the error could be related to some other property of the BAM file.

I would be happy to check this if you could please send to me the first lines of the BAM file plus the header, which you can get with the following command:

samtools view -h ${BAM} | head -n 100

tamaraprieto commented 4 years ago

Thanks!Here it is HeaderBAM.txt By the way, my matched-normal BAM does not have any barcode, I was wondering if this could be problematic.

simozacca commented 4 years ago

I have tested your example by appending the reads from the first file that you sent to the header in the second file and I have found an error in the SAM file that did not allow to sort the BAM file.

Were you able to sort and index the merged BAM file (using samtools sort and index)? When I tried to sort the BAM file I received a "truncated file error" due to the presence of one sequencing read that had an error, i.e. the field PNEXT is missing (https://en.wikipedia.org/wiki/SAM_(file_format)):

A00500:57:HGTY7DSXX:3:2616:15944:7279   113     1       10000   0       65S26M4D14M4D43M        12      66451373        0CCTCACCCTATCAGAAACCTAACCCTCACCCTACCGCTAACCCTAACCACACCCTAACCCTAACCATAACCCTAACCCTAACCCTAA\
CCCTCTAACCCTAACCCTCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC   -??->>>-(:*?2-@-:-:7-9-->>,,---=,-6.->-,--:7-,-?,,>9-:>-,->:>@,.,,8A-:.@8<??:@)<?.?@)A?:?;@@?A?.;@?A?:>?+;)@>>>?>@>>>?>@>>>?=?===>=?===><><<<<:><<;9XA:Z:20,+62918662,41M2I10M1D4M5D25M1D16M50S,9;6,-147837,36S14M2D15M1D24M2I14M5D22M1D21M,13;X,+155260360,43M1I6M8D7M1I6M1D6M1D33M45S,13; MC:Z:6S91M51S   MD:Z:26^AACC14^AACC43   NM:i:8  AS:i:63 XS:i:63 \
RG:Z:CB:Z:CAATTTAG

After removing this read, I was able to sort and index the BAM file and CHISEL ran smoothly on your example.

My suggestion is therefore to try to sort the BAM file and remove the erroneous sequencing reads that do not allow to perform such operation. Sorting and indexing are required prerequisites for running CHISEL on the corresponding BAM file.

Please let me know if this will fix your issue. In case you were able to successfully sort and index the merged BAM file, I would be happy to schedule a video call to identify together your issue.

tamaraprieto commented 4 years ago

Thanks you very much!You were right, I had not sorted the bam files, sorry about that. Now that error is solved, but I got a different one. I attach the Log here in case you can take a look at it Log.txt

simozacca commented 4 years ago

The error is probably due to the presence of a chromosome which is not an autosome chr1, ..., chr22 in the file with phased SNPs; for example, do you have chrX or chrY or chrM in the file of phased SNPs? If you can create a copy of the phased SNP file excluding these chromosomes, everything should work properly.

I will integrate the automatic skipping of such chromosome in the next release in order to avoid the need of this preprocessing, and I am reopening this issue until this integration is completed.

tamaraprieto commented 4 years ago

It seems the error was indeed due to the presence of sex chromosomes within the phased SNPs and the next step ran properly. Thank you! However, I got a different error later. I paste a few lines from stderr below:

Progress: |████████████████████████████████████████| 100.0% Complete [[2020-Jun-29 03:31:27]Found clustering with obj: 0.146110889837 [Iterations: 7]] [2020-Jun-29 03:31:27]Objective value for 42: 0.139464303813 [2020-Jun-29 03:31:27]Computing for 41: [2020-Jun-29 03:31:27]Objective value for 41: 0.142646938193 [2020-Jun-29 03:31:27]Computing for 42: [2020-Jun-29 03:31:27]Objective value for 42: 0.139464303813 [2020-Jun-29 03:31:27]Estimating RDR and BAF of every cluster [2020-Jun-29 03:31:27]Selecting ploidies Progress: |████████████████████████████████████████| 100.0% Complete [[2020-Jun-29 03:31:31]Called cell TATAATAG] [2020-Jun-29 03:31:32]Number of cells for every ploidy' level: Cells with base ploidy 4: 23 [2020-Jun-29 03:31:32]Inferring copy numbers [2020-Jun-29 03:31:32]Phasing copy-number states along the genome Progress: |████████████████████████████████████████| 100.0% Complete [[2020-Jun-29 03:31:32]Phased chromosome 18] [2020-Jun-29 03:31:32]Writing results [2020-Jun-29 03:31:32]Cloning [2020-Jun-29 03:31:32]Parsing and checking arguments [2020-Jun-29 03:31:32]Arguments: minsize : 14 refinement : 0.0 seed : 25 maxdiff : 0.06 input : /mnt/lustre/scratch/home/uvi/be/posadalustre/CLL/CLL04/RESULTS//Chisel/Results/calls/calls.tsv linkage : single [2020-Jun-29 03:31:32]Reading input [2020-Jun-29 03:31:32]Clustering cells in clones [2020-Jun-29 03:31:32]Selecting clones [2020-Jun-29 03:31:32]Number of identified clones: 0 [2020-Jun-29 03:31:32]Refining clustering Traceback (most recent call last): File "/mnt/netapp1/posadalab/APPS/CommonCondaEnvironments/chisel/bin/../src/Cloner.py", line 158, in main() File "/mnt/netapp1/posadalab/APPS/CommonCondaEnvironments/chisel/bin/../src/Cloner.py", line 76, in main clones, clus = refining(cns, clus, clones, args['refinement']) File "/mnt/netapp1/posadalab/APPS/CommonCondaEnvironments/chisel/bin/../src/Cloner.py", line 143, in refining ref = {c : closest(c) for c in clus if c not in chosen.keys()} File "/mnt/netapp1/posadalab/APPS/CommonCondaEnvironments/chisel/bin/../src/Cloner.py", line 143, in ref = {c : closest(c) for c in clus if c not in chosen.keys()} File "/mnt/netapp1/posadalab/APPS/CommonCondaEnvironments/chisel/bin/../src/Cloner.py", line 142, in closest = (lambda c : min([(i, weight(i, c)) for i in clones], key=(lambda x : x[1]))) ValueError: min() arg is an empty sequence [2020-Jun-29 03:31:32]Plotting [2020-Jun-29 03:31:39]Parsing and checking arguments [2020-Jun-29 03:31:39]Arguments: format : png plotsize : (5.0, 1.5) sample : 20 xmin : None clonemap : /mnt/lustre/scratch/home/uvi/be/posadalustre/CLL/CLL04/RESULTS//Chisel/Results/clones/mapping.tsv nonoisy : False gridsize : (12.0, 6.0) ymax : None clussize : (5.0, 3.0) xmax : None ymin : None input : /mnt/lustre/scratch/home/uvi/be/posadalustre/CLL/CLL04/RESULTS//Chisel/Results/calls/calls.tsv [2020-Jun-29 03:31:39]Reading input [2020-Jun-29 03:31:39]Number of cells: 23 [2020-Jun-29 03:31:39]Number of bins: 293 [2020-Jun-29 03:31:39]Setting style [2020-Jun-29 03:31:39]Reading clonemap Traceback (most recent call last): File "/mnt/netapp1/posadalab/APPS/CommonCondaEnvironments/chisel/bin/../src/Plotter.py", line 513, in main() File "/mnt/netapp1/posadalab/APPS/CommonCondaEnvironments/chisel/bin/../src/Plotter.py", line 94, in main index, clones, selected = clonemap_to_index(args['clonemap'], cells) File "/mnt/netapp1/posadalab/APPS/CommonCondaEnvironments/chisel/bin/../src/Plotter.py", line 216, in clonemap_to_index mapc = [(clonemap[e], e) for e in cells] KeyError: 'AAAGACAT' 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

simozacca commented 4 years ago

CHISEL completed the inference successfully, the error that you see is because no clone has been identified from the inferred results (we will correct the error with a warning message in the upcoming version).

This is expected because the default parameters of the method for clone inference (through clustering of cells) are defined for datasets with number of cells closer to those observed in 10X Genomics (>500 cells), instead your dataset only has 23 cells. Therefore, you will need to run the clone inference with different parameters: this can be done extremely efficiently and easily using the CHISEL command chisel-cloning.py, which will only re-run the clone inference. Please take a look of this guide. Specifically, you should try to decrease the minimum number of cells in a clone (i.e. -s) to a reasonable value for your case, e.g. -s 3, and then you should investigate the results by varying the maximum difference between cells in the same clone (i.e. -f) with values of -f between 0.06 and 0.3. You can analyse the corresponding results by looking at the plots.

simozacca commented 4 years ago

More details about generating a barcoded BAM file from multiple BAM files of singles cells are in https://github.com/raphael-group/chisel/issues/12#issuecomment-694323622