raphael-group / chisel

CHISEL -- Copy-number Haplotype Inference in Single-cell by Evolutionary Links
BSD 3-Clause "New" or "Revised" License
37 stars 11 forks source link

Doesn't work with hg38 reference genome #27

Closed aboffelli closed 1 year ago

aboffelli commented 1 year ago

Hello,

I am trying to run chisel on whole genome sequencing data, but I keep getting this error in the alignment of simulated sequencing reads step:

[2023-Feb-14 15:55:51]Aligning simulated sequencing reads
[2023-Feb-14 15:55:58]Alignment failed with messages:

[E::sam_parse1] SEQ and QUAL are of different length
samtools sort: truncated file. Aborting

The steps I did so far were:

The chisel command that I am using is this:

chisel_nonormal --chromosomes "chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chr23" \
    --reference $REF\
    --minreads 30000 --size 10Mb --blocksize 0 --seed 12 \
    --simcov 0.02 --nophasecorr --missingsnps 5,5 --maxploidy 5\
    --tumor merged.progeny.barcoded.bam --listphased phases.tsv

Where my bam file looks like this:

NB501840:480:H3HLKBGXM:3:12509:14149:15902  16  chr1    10000   0   19S55M  *   0   0   AACAGCTCTTCCGATCTCAATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC  AEA/</EEEE///E/EE6E/E</E//<EE/E/EE///EAE///EEEEE/A/EEE/AE/EE/6EEAEAEE/AA/A  NM:i:0  MD:Z:55 AS:i:55 XS:i:54 RG:Z:CB:Z:CTTTCGTACATA-0
NB501840:480:H3HLKBGXM:4:12512:25884:13466  16  chr1    10000   0   26S48M  *   0   0   CGCTCTTCCGATCTAGTGACGCATCCATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC  EEAEAEAAEEEEEEEEEEEEEEEEEEEEA<AEEE/EEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAAAA  NM:i:0  MD:Z:48 AS:i:48 XS:i:47 RG:Z:CB:Z:CCGGCAAAAAAC-0
NB501840:481:H3H3VBGXM:2:21206:22318:4622   16  chr1    10000   0   44S34M  *   0   0   CCCAACGTCAACGCAATCCTCAATCCCTCAGCATAGCCATACCCATAACCCTAACCCTAACCCTAACCCTAACCCTAA  ///E////////<////////////A////////</E/////E/<//EEEAAE/EEEAE/EEEEEEEEEEEEEAAAAA  NM:i:0  MD:Z:34 AS:i:34 XS:i:34 RG:Z:CB:Z:GAGCTAAGTTCG-0
NB501840:471:HVKKMBGXL:3:21411:26492:9120   16  chr1    10000   0   31S43M  *   0   0   CCATCCCGTAAGTACATCTCTTACTATATCCATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC  ///////////E////////E/E/////////AE////E//E/AE</E/EEE<E/AEEA/EEAEEAE//A<A66  NM:i:0  MD:Z:43 AS:i:43 XS:i:42 RG:Z:CB:Z:TGCTCGCCCCCT-0
NB501840:480:H3HLKBGXM:1:11206:9854:3372    0   chr1    10040   0   70M1I3M *   0   0   CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCATACCACTAACCCTAACCCAGACC  AAAAAAEEEEEEEE<EEEA/AAEEEA/EEEEA/E/E6AA/////6/A/E///E/E/6//E//////E<///E//  NM:i:4  MD:Z:50C2A1C17  AS:i:55 XS:i:54 RG:Z:CB:Z:GATGTGTTTCCG-0

and my phases.tsv file looks like this:

chr1    817341  1|0
chr1    822764  1|0
chr1    894801  0|1
chr1    900119  0|1
chr1    909894  0|1

This is the complete stdout of the command:

[2023-Feb-14 16:14:55]Parsing and checking arguments
[2023-Feb-14 16:14:55]Arguments:
        art : art_illumina
        reference : /home/boffelli/mystorage/Data/RefGenome/GRCh38_full_analysis_set_plus_decoy_hla.fa.gz
        missingsnps : 5,5
        blocksize : 0
        binstats : None
        seed : 12
        cellsuffix : None
        jobs : 20
        rundir : /crex/proj/snic2022-22-1233/private/running
        samtools : samtools
        binsize : 10000000
        gccorr : True
        keeptmpdir : False
        chromosomes : chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chr23
        upperk : 100
        simcov : 0.02
        cellprefix : CB:Z:
        bwa : bwa
        minimumsnps : 0.08
        listphased : phases.tsv
        maxploidy : 5
        minreads : 30000
        tumor : /crex/proj/snic2022-22-1233/private/running/merged.barcoded.progeny.bam
        phasecorr : False
        bcftools : bcftools

[2023-Feb-14 16:14:55]Setting directories
[2023-Feb-14 16:14:55]Some of the working folders already exist in the running directory and content will be overwritten, please interrupt the process if this was not intended.
[2023-Feb-14 16:14:55]Simulating DNA sequencing reads for mappability correction
[2023-Feb-14 16:14:55]Retrieving sequencing stats
Progress: |██████████████████████████████| 100.0% Complete [[2023-Feb-14 16:15:44]Stats retrieved for chr8:120000000-130000000]
[2023-Feb-14 16:15:44]BAM has been identified as paired-end sequencing with read length 74
[2023-Feb-14 16:15:44]Simulating sequencing reads
[2023-Feb-14 16:16:45]Aligning simulated sequencing reads
[2023-Feb-14 16:16:51]Alignment failed with messages:

[E::sam_parse1] SEQ and QUAL are of different length
samtools sort: truncated file. Aborting

The demo complete-nonormal works perfectly. I would appreciate is someone could help me understand what I am doing wrong. Please let me know if you need more info!

Thank you in advance!

Arthur

aboffelli commented 1 year ago

Hi again,

I figure it out and apparently the problem was that I was using the hg38 reference genome. Using hg19 works fine. The problem seemed to be on art_illumina, where it would generate a problematic simulated fastq file.

All the best, Arthur

simozacca commented 1 year ago

Thank you for your feedback and for investigating the issue. CHISEL actually supports hg38, as hg38 was the reference genome build used for the analysis of patient S0 in the related manuscript and it is used in one of the main available demos for tumour section E of patient S0. Specifically, the tested build and release of hg38 is the one distributed by 10x Genomics (for example, available here).

It is however possible that there are builds or versions of hg38 that might be incompatible with art_illumina, which is used by CHISEL in the new no-normal version to simulate sequencing reads to correct for mapping biases. So if you could please provide details of the reference genome build that generated your issues, we will investigate it further. In particular, we would be grateful if you could please test the run of art_illumina with the following command and let us know the result:

art_illumina -ss HS20 -na -i ${REF} -p -l 100 -f 0.2 -m 350 -s 20 -o test

where ${REF} is the path to your reference genome (the one provided to CHISEL no-normal version).

aboffelli commented 1 year ago

Of course. I tried two different reference genome and had the same error in both of them. this one and this one (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa)

The only change I've made in both was converting chrX to chr23 and chrY to chr24 with the following code:

zcat hg38.fa.gz | sed -e 's/chrX/chr23/' -e 's/chrY/chr24/' > hg38_new.fa

I tried running the art_illumina command that you asked. This is the stdout of the command:

    ====================ART====================
             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

Warning: your simulation will not output any ALN or SAM file with your parameter settings!
                  Paired-end sequencing simulation

Total CPU time used: 246.379

The random seed for the run: 1677063070

Parameters used during run
    Read Length:    100
    Genome masking 'N' cutoff frequency:    1 in 100
    Fold Coverage:            0.2X
    Mean Fragment Length:     350
    Standard Deviation:       20
    Profile Type:             Combined
    ID Tag:                   

Quality Profile(s)
    First Read:   HiSeq 2000 Length 100 R1 (built-in profile) 
    First Read:   HiSeq 2000 Length 100 R2 (built-in profile) 

Output files

  FASTQ Sequence Files:
     the 1st reads: test1.fq
     the 2nd reads: test2.fq

And these are the output files created:

(chisel) boffelli@fedora:test_art > head test*
==> test1.fq <==
@Ē%Ŏ,a;+�c�-50/1
9�̡(��H���3A�8^Q�I�7
                   3T���� =X��������{$*��GB3<�K�B��#�EH���R~!,�
                                                               IDZ�E��C��
+
?CCDFFDFFHHHCGJBGDI+GJIJEIGJI>CJJJ<GIHDJ?CEEHAGJFJJIHEGIJIFFJGCJEDB7HDIJE?A;BDHDDDC+3DDED?CCDDBB;DCE
@Ē%Ŏ,a;+�c�-44/1
NNNNNNNNNNNNNNANNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNN
+
""""""""""""""I""""":"""""""""""""""""""""""""""""""""J"""""""""""""""""""""""""""D"""""""""""""""""
@Ē%Ŏ,a;+�c�-38/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

==> test2.fq <==
@Ē%Ŏ,a;+�c�-50/2
NNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNN
+
""""""""""""""I"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""3"""""""""""""""""""""""
@Ē%Ŏ,a;+�c�-44/2
4��˽�M���Eҵ�&�����?*=��0��F��1�==C��^��
                                       ·��WD��<OǒS;]T�WE�3A-�`?R��6*�PY�;���~��*��R�6�~��
+
@CCD+D+DHH?BAJGGGJICJIFJHJCJ)EII<GJ=IJFDGGGH;IJFIFFEJJEFJHDJIJDJA?&J'C@DFEJHDHB;BFDADEDDCHA<DD+(D+DD
@Ē%Ŏ,a;+�c�-38/2
����DO� �{`��+Sܫ��T��)3+�C,O�#�%X���S1ܪ$T*TJ�<CL���˯��*7<�W�A_9BO�^�:�V";�C(*+�]J
                                                                                 ��PO

For some reason it creates weird binary files. The same was happening in the CHISEL process, and the error would come when it tried to read these simulated files.

I hope you can figure it out!

Really nice software apart from this small problem!

simozacca commented 1 year ago

The names of the chromosomes in the reference genome cannot be simply changed because it will not match the previously generated indexes, dictionaries, etc. Could you please try to run the same command but with an unchanged reference genome?

If you need to change the names of the chromosomes you will need to re-generate all the bwa and samtools indexes at least, using for example the commands


bwa index hg38.fa
samtools faidx hg38.fa
samtools dict hg38.fa > hg38.dict
aboffelli commented 1 year ago

Of course, I am aware of that. All my indexes were generated after changing the nomenclature. All steps previously mentioned were performed with the right indexes.