raphael-group / chisel

CHISEL -- Copy-number Haplotype Inference in Single-cell by Evolutionary Links
BSD 3-Clause "New" or "Revised" License
37 stars 11 forks source link

AssertionError: There is a bin with a BAF shift > 0.5, likely BAF was not mirrored between 0 and 0.5 #12

Closed markutz556 closed 4 years ago

markutz556 commented 4 years ago

Hi Team,

Thanks for the CHISEL tools! We've been trying to apply this tool to our data and we got this error in the Selecting ploidies step: AssertionError: There is a bin with a BAF shift > 0.5, likely BAF was not mirrored between 0 and 0.5. And then the rest steps all got into errors. I wonder if you have any ideas what might cause this and how we can fix this. Any ideas will be appreciated. Thanks!

Best,

simozacca commented 4 years ago

Thank you for your interest in CHISEL! I would be very happy to help with your issue.

First, could you please provide some details about the data that you are using (e.g. barcoded BAM file, how many cells, sequencing coverage per cell, etc.) and which version of CHISEL you applied?

Second, could you please provide the full log message of CHISEL, i.e. all the log messages that have been printed out by CHISEL during the execution?

Last, could you please provide the first 30 lines of the file bb/combo.tsv (head -n 30 bb/combo.tsv > log) that has been produced by CHISEL in the running directory?

Thank you for your cooperation!

markutz556 commented 4 years ago

Thanks for your reply. We have around 800 cells in the barcoded bam file but they are not sequenced in 10x platform and I did the processing step you mentioned in the other issue. The version of CHISEL is 0.0.4 installed via conda.

Here is the log file for the run: log.txt.

The first 30 lines of the file bb/combo.tsv: log_combo.txt

We had a problem when submitting the Y chromosome mpileup result to the Michigan Imputation Server and thus we left out the Y chromosome. Does this matter?

Thanks for your help!

simozacca commented 4 years ago

Thanks for this information!

Unfortunately there seems to be some issues with the cell names/barcodes:

These observations may be due to some errors in the barcoded BAM file, however I would be very happy to look into this too and to help fixing any potential issue in the generation of such barcoded BAM file.

To start to do this, could you please provide some details on how you generated the barcoded BAM file and could you please extract the first 3 sequencing reads (SAM records) from the barcoded BAM file ${BAM} with the following command:

samtools view ${BAM} | head -n 3
markutz556 commented 4 years ago

Thanks for the information.

I used this shell script to append the barcodes to the bams:

cust_func(){
        samtools addreplacerg -r 'ID:CB:Z:'"${i}"'\tSM:CELL_BARCODE' ${i}.bam -o ${i}_barcoded.bam -O BAM -@ 8
}

for i in $(cat barcodes.txt); do
         echo $i
         cust_func ${i} &
done

where barcodes.txt contains a list of barcodes and each of barcodes has a corresponding bam file with the name ${barcode}.bam.

Then I merge all the barcoded.bam fils using samtools merge

The head 3 reads for the barcoded bam file: barcoded_bam_head3.txt

Thanks for your help.

simozacca commented 4 years ago

Thanks for the provided information!

The problem is that CHISEL requires each cell barcode to be a string only composed of the letters A,C,G,T preceded by CB:Z: (see details in required data). Thus, the cell barcode RG:Z:CB:Z:TFRIPAIR4_FL-A98167-R09-C26 in your example is not a valid barcode, while RG:Z:CB:Z:TATTGGCA would be.

You can generate an arbitrary list of 800 cell barcodes of 8 letters with the following script:

for BARCODE in {A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}{A,C,G,T}; do echo "CB:Z:"${BARCODE}; done | shuf -n 800

and you can use these barcodes for your cells. Using these barcodes instead of the current names in your barcodes.txt everything should work correctly. Also, if barcodes.txt contains file names you can append the barcodes as a second column of the file and in the command you can simply use the first field of each line as the file name and the second field as the barcode to add as an RG flag.

Just one additional note, I think that you may need ${1} instead of ${i} in your function cust_func() to access the first argument.

Please let me know if this procedure would work for you and if you have any further question.

markutz556 commented 4 years ago

Thanks for your help. I changed the barcode names as you suggested and CHISEL can identify the correct number of cells now. But I still got into the an error on the extracting chromosome step: ValueError: invalid literal for int() with base 10: ''.

I've been trying to find the problem but with no success. I wonder if you can help with this error. I included the error log for this: log.txt

Just for reference, these are the first 5 lines for our phased snp file:

1 54676         1|0
1 91536         1|0
1 546697    1|0
1 705882    1|0
1 757394    1|0

Thank you!

simozacca commented 4 years ago

The error seems due to the presence of a non autosome (chr1, ..., chr22) in the list of phased SNPs. Could you please try to create a new list of phased SNPs by grepping only the lines that start with an automosome (i.e. chr1, ..., chr22) and run CHISEL on this list?

I am sorry for this inconvenience, in the next release we are adding the filtering of auotosomes as an automatic feature.

simozacca commented 4 years ago

We assume that the issue has been solved; please feel free to re-open the issue if this is not the case or there is any other related issue.