Closed heathergeiger closed 1 year ago
@heathergeiger I added popscle_dsc_pileup_merge_splitted.py
in https://github.com/aertslab/popscle_helper_tools/, which should be able to do this.
# Input filenames should match this glob: popscle_dsc_pileup_outputs/popscle_dsc_pileup_parts.*.pileup.{cel,plp,umi,var}.gz
./popscle_dsc_pileup_merge_splitted.py \
-i popscle_dsc_pileup_outputs/popscle_dsc_pileup_parts \
-o popscle_dsc_pileup_merged_output/popscle_dsc_pileup_parts
Use filter_bam_file_for_popscle_dsc_pileup.sh
to create multiple BAM files for different sublists of cell barcodes.
https://github.com/aertslab/popscle_helper_tools/blob/master/popscle_dsc_pileup_merge_splitted.py
Hi @ghuls thanks for sharing the script. I'm trying it and getting this error:
File "/Users/jose/Documents/ucl_postdoc/genotyping_sc/freemuxlet_evaluation/merger_tool/popscle_dsc_pileup_merge_splitted.py", line 370, in 101110111011101111011010001010001000101110001111101
as dtype i64
at column 'ALLELES' (column number 3).
The current offset in the file is 16232 bytes.
You might want to try:
infer_schema_length
(e.g. infer_schema_length=10000
),dtypes
argumentignore_errors
to True
,101110111011101111011010001010001000101110001111101
to the null_values
list.I added the argument infer_schema_length=10000
in the pl.read_csv but still getting the same error. Any idea how to fix this? Thanks
@josemovi Could you try with infer_schema_length=0
?
I am very surprised to see 101110111011101111011010001010001000101110001111101
in the ALLELLES column.
Can you output those lines:
$ zcat that_file | grep -C 5 101110111011101111011010001010001000101110001111101
I tried with infer_schema_length=0
but got the same error
This is the ouput from one of the pipleup files I want to merge
260 57 011 :FF 261 57 111111 F:FFFF 263 57 01 FF 265 57 1 F 266 57 0 F 267 57 101110111011101111011010001010001000101110001111101 FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF 270 57 0 F 271 57 01 FF 272 57 11000 FFFFF 273 57 111001 FFFFFF 276 57 1 F
These long strings of 0&1 in the ALLELES columns are puzzling me as in the variant file *pileup.var.gz
the SNP 57 reads:
57 chr1 1013541 T C 0.95190
And the same SNP in the VCF file that I used initially reads:
chr1 1013541 1:948921 T C . PASS AC=58;AF=0.9519;AN=60;MAF=0.0481;R2=0.73447
I tried to ask this to @hyunminkang but I think I didn't comunicate the issue properly.
Thanks for any help
There is nothing wrong with the output. I think polars is not recognizing the column as intended.
@josemovi infer_schema_length=0
should have worked, but probably you added it to the wrong read_csv reader call.
The latest git version solves it normally in a better way. Let me know how it goes.
@ghuls thanks the new version does merge the files. However, when running the output in Freemuxlet I'm getting a similar error than when I tried to merge the files using my own scripts:
NOTICE [2023/04/21 14:36:39] - Reading barcode information from F1828VB-merged.pileup.cel.gz..
FATAL ERROR - [E:int sc_dropseq_lib_t::load_from_plp(const char , BCFFilteredReader , const char , double, double, const char , bool)] Observed DROPLET_ID 9203 is different from expected DROPLET_ID. Did you modify the digital pileup files by yourself?
terminate called after throwing an instance of 'pException' what(): Exception was thrown Aborted
DROPLET_ID 9203 is the first droplet of the second file I'm merging. Thanks
@josemovi Probably you didn't create your splitted files correctly.
The output works fine for me with popscle freemuxlet
First get a list of all your cell barcodes.
Split them in x (e.g 10) files with similar number of cell barcodes.
Run filter_bam_file_for_popscle_dsc_pileup.sh
with your BAM file for each cell barcode file.
Run popscle dsc pileup with each BAM file from the previous step.
Then run the popscle_dsc_pileup_merge_splitted.py
script.
It worked! I'm actually using two set of pileup files from two different samples (two different tissue types & two chromium runs). My aim is to merge several pileup outputs of different tissues from the same patient. I just needed to change the barcode tag ({barcode}-1) to a unique tag per sample to avoid duplicated barcodes. Big thanks
This is I suppose more of a feature request than an issue, but would you be able to provide a helper script to combine the results of dsc-pileup? Currently, dsc-pileup requires quite a lot of memory for our samples even after filtering to only reads covering variants and to reads with valid cell barcodes (required 117G for a recent sample, granted with over 30,000 cells due to overloading with hashing). It seems like the ideal way to reduce the memory usage and also runtime would be to run dsc-pileup in parallel for a few separate subsets of cells subsetted beforehand with subset-bam. But it is not obvious how to recreate the output files ("cel/plp/umi/var/") for use with freemuxlet from these separate outputs if you do it this way. For example, how to map the "DROPLET_ID" to the barcode sequences if you do it this way. Any help you could provide (I could even help write something if you help me to understand the file structure a bit more) would be much appreciated. Thanks.