vallotlab / scChIPseq_DataEngineering

This pipeline is dedicated to create single-cell count matrix from paired-end raw fastQ files coming from single-cell ChIP-seq experiments.
Other
4 stars 1 forks source link

Low no. of reads when re-analyzing Grosselin et. al. 2019 data #1

Closed vivekbhr closed 2 years ago

vivekbhr commented 2 years ago

Hello

I have processed the data from Grosselin et. al. 2019 from here through this pipeline. But I seem to recover a much lower number of reads compared to what I'd expect from the paper. (total 887 barcodes detected with mean count of ~5 and max count of ~200). I mapped the data to both human and mouse genome separately using STAR index.

What I see is that while there are enough uniquely mapping reads, the number of barcoded reads that I keep are extremly low, even though the number of detected barcodes from the bowtie mapping seems to be reasonable (I attached the numbers per sample).

Could you help me figure out what the issue could be?

uniq_mapped.txt

barcode_matches.txt

flagged_bam_reads.txt

Pacomito commented 2 years ago

Hello Vivek, Yes indeed, these numbers seem very low and we did retrieve more reads and cells per sample.

What CONFIG file did you use ?

For the samples from Grosselin et al., the CONFIG_Hifibio_shortIndex should be used.

What is important is that the conformation and the barcode sequences are not the same than the 'LBC' configuration. If you used the 'LBC' config file, then it make sense that there were a very low number of matched barcodes.

Here are the important parameters:

BARCODE_LENGTH = 68 BARCODE_LINKER_LENGTH = 92

and

BARCODE_BOWTIE_IDX_PATH

This path should be modified in the config file to point to the index on your computer that you download from Barcodes_HiFiBio.zip.

Best regards, Pacome

vivekbhr commented 2 years ago

Hi Pacome

Thanks for the reply. I actually modified the CONFIG_template from the devel branch as I saw that it was recommended in an earlier thread , I see that the barcode and linker length there is different, so that could indeed be the issue here. I am going to try with the new config and get back to you. Would you suggest using the workflow from master branch instead?

Pacomito commented 2 years ago

Okay sorry about that I thought you were using the master branch. No you did right I recommend using the 'devel' branch which is more up to date.

Yes you have to make sure that BARCODE_LENGTH = 68 BARCODE_LINKER_LENGTH = 92 and that BARCODE_BOWTIE_IDX_PATH points towards the HIFIBIO bowtie barcode indexes in 'scChIPseq_DataEngineering/tree/devel/Barcodes/Barcodes_HiFiBio/index_barcode/bowtie_2_index_short/ '

Please let me know if it works out for you

vivekbhr commented 2 years ago

Hi. So it seems I do get the counts now in the range that's reported in the paper. How shall I combine the counts from human and mouse mapping? For the Jurkat-Ramos sample it's all human. But for others (HBCx-22/95/tamR/capR), shall I take a union of barcodes from the 2 mappings and then take a count cutoff?

Edit: I only found 60 conflicting barcodes between mouse and human with unique count cutoff >= 1600, so I removed them.. Thanks for your help!

Pacomito commented 2 years ago

Hi, Yes there were very few cells that contained a large amount of reads from both human and mouse, Nice that you managed to find a way to do it.

Cheers, Pacôme

PS: If you need tools to re-analyse the dataset, we designed ChromSCape specifically for handling single-cell ChIP-seq data. It runs in a Shiny application that you launch from R and you simply need to input the human or mouse count matrices.