Combining dsc-pileup outputs?

heathergeiger commented 1 year ago

This is I suppose more of a feature request than an issue, but would you be able to provide a helper script to combine the results of dsc-pileup? Currently, dsc-pileup requires quite a lot of memory for our samples even after filtering to only reads covering variants and to reads with valid cell barcodes (required 117G for a recent sample, granted with over 30,000 cells due to overloading with hashing). It seems like the ideal way to reduce the memory usage and also runtime would be to run dsc-pileup in parallel for a few separate subsets of cells subsetted beforehand with subset-bam. But it is not obvious how to recreate the output files ("cel/plp/umi/var/") for use with freemuxlet from these separate outputs if you do it this way. For example, how to map the "DROPLET_ID" to the barcode sequences if you do it this way. Any help you could provide (I could even help write something if you help me to understand the file structure a bit more) would be much appreciated. Thanks.

ghuls commented 1 year ago

@heathergeiger I added popscle_dsc_pileup_merge_splitted.py in https://github.com/aertslab/popscle_helper_tools/, which should be able to do this.

# Input filenames should match this glob: popscle_dsc_pileup_outputs/popscle_dsc_pileup_parts.*.pileup.{cel,plp,umi,var}.gz
./popscle_dsc_pileup_merge_splitted.py \
    -i popscle_dsc_pileup_outputs/popscle_dsc_pileup_parts \
    -o popscle_dsc_pileup_merged_output/popscle_dsc_pileup_parts

Use filter_bam_file_for_popscle_dsc_pileup.sh to create multiple BAM files for different sublists of cell barcodes.

https://github.com/aertslab/popscle_helper_tools/blob/master/popscle_dsc_pileup_merge_splitted.py

josemovi commented 1 year ago

Hi @ghuls thanks for sharing the script. I'm trying it and getting this error:

File "/Users/jose/Documents/ucl_postdoc/genotyping_sc/freemuxlet_evaluation/merger_tool/popscle_dsc_pileup_merge_splitted.py", line 370, in main() File "/Users/jose/Documents/ucl_postdoc/genotyping_sc/freemuxlet_evaluation/merger_tool/popscle_dsc_pileup_merge_splitted.py", line 353, in main write_popscle_pileup_plp_full_filename( File "/Users/jose/Documents/ucl_postdoc/genotyping_sc/freemuxlet_evaluation/merger_tool/popscle_dsc_pileup_merge_splitted.py", line 122, in write_popscle_pileup_plp_full_filename pl.read_csv( File "/Users/jose/Library/Python/3.9/lib/python/site-packages/polars/io/csv/functions.py", line 355, in read_csv df = pli.DataFrame._read_csv( File "/Users/jose/Library/Python/3.9/lib/python/site-packages/polars/dataframe/frame.py", line 777, in _read_csv self._df = PyDataFrame.read_csv( exceptions.ComputeError: Could not parse 101110111011101111011010001010001000101110001111101 as dtype i64 at column 'ALLELES' (column number 3). The current offset in the file is 16232 bytes.

You might want to try:

increasing infer_schema_length (e.g. infer_schema_length=10000),
specifying correct dtype with the dtypes argument
setting ignore_errors to True,
adding 101110111011101111011010001010001000101110001111101 to the null_values list.

I added the argument infer_schema_length=10000 in the pl.read_csv but still getting the same error. Any idea how to fix this? Thanks

ghuls commented 1 year ago

@josemovi Could you try with infer_schema_length=0 ?

I am very surprised to see 101110111011101111011010001010001000101110001111101 in the ALLELLES column. Can you output those lines:

$ zcat that_file | grep -C 5 101110111011101111011010001010001000101110001111101

josemovi commented 1 year ago

I tried with infer_schema_length=0 but got the same error

This is the ouput from one of the pipleup files I want to merge

260 57 011 :FF 261 57 111111 F:FFFF 263 57 01 FF 265 57 1 F 266 57 0 F 267 57 101110111011101111011010001010001000101110001111101 FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 270 57 0 F 271 57 01 FF 272 57 11000 FFFFF 273 57 111001 FFFFFF 276 57 1 F

These long strings of 0&1 in the ALLELES columns are puzzling me as in the variant file *pileup.var.gz the SNP 57 reads:

57 chr1 1013541 T C 0.95190

And the same SNP in the VCF file that I used initially reads:

chr1 1013541 1:948921 T C . PASS AC=58;AF=0.9519;AN=60;MAF=0.0481;R2=0.73447

I tried to ask this to @hyunminkang but I think I didn't comunicate the issue properly.

Thanks for any help

hyunminkang commented 1 year ago

There is nothing wrong with the output. I think polars is not recognizing the column as intended.

ghuls commented 1 year ago

@josemovi infer_schema_length=0 should have worked, but probably you added it to the wrong read_csv reader call. The latest git version solves it normally in a better way. Let me know how it goes.

josemovi commented 1 year ago

@ghuls thanks the new version does merge the files. However, when running the output in Freemuxlet I'm getting a similar error than when I tried to merge the files using my own scripts:

NOTICE [2023/04/21 14:36:39] - Reading barcode information from F1828VB-merged.pileup.cel.gz..

FATAL ERROR - [E:int sc_dropseq_lib_t::load_from_plp(const char , BCFFilteredReader , const char , double, double, const char , bool)] Observed DROPLET_ID 9203 is different from expected DROPLET_ID. Did you modify the digital pileup files by yourself?

terminate called after throwing an instance of 'pException' what(): Exception was thrown Aborted

DROPLET_ID 9203 is the first droplet of the second file I'm merging. Thanks

ghuls commented 1 year ago

@josemovi Probably you didn't create your splitted files correctly.

The output works fine for me with popscle freemuxlet

First get a list of all your cell barcodes. Split them in x (e.g 10) files with similar number of cell barcodes. Run filter_bam_file_for_popscle_dsc_pileup.sh with your BAM file for each cell barcode file. Run popscle dsc pileup with each BAM file from the previous step. Then run the popscle_dsc_pileup_merge_splitted.py script.

josemovi commented 1 year ago

It worked! I'm actually using two set of pileup files from two different samples (two different tissue types & two chromium runs). My aim is to merge several pileup outputs of different tissues from the same patient. I just needed to change the barcode tag ({barcode}-1) to a unique tag per sample to avoid duplicated barcodes. Big thanks

statgen / popscle

Combining dsc-pileup outputs? #60