Closed aboffelli closed 1 year ago
Hi again,
I figure it out and apparently the problem was that I was using the hg38 reference genome. Using hg19 works fine. The problem seemed to be on art_illumina, where it would generate a problematic simulated fastq file.
All the best, Arthur
Thank you for your feedback and for investigating the issue. CHISEL actually supports hg38, as hg38 was the reference genome build used for the analysis of patient S0 in the related manuscript and it is used in one of the main available demos for tumour section E of patient S0. Specifically, the tested build and release of hg38 is the one distributed by 10x Genomics (for example, available here).
It is however possible that there are builds or versions of hg38 that might be incompatible with art_illumina
, which is used by CHISEL in the new no-normal version to simulate sequencing reads to correct for mapping biases. So if you could please provide details of the reference genome build that generated your issues, we will investigate it further. In particular, we would be grateful if you could please test the run of art_illumina
with the following command and let us know the result:
art_illumina -ss HS20 -na -i ${REF} -p -l 100 -f 0.2 -m 350 -s 20 -o test
where ${REF} is the path to your reference genome (the one provided to CHISEL no-normal version).
Of course. I tried two different reference genome and had the same error in both of them. this one and this one (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa)
The only change I've made in both was converting chrX to chr23 and chrY to chr24 with the following code:
zcat hg38.fa.gz | sed -e 's/chrX/chr23/' -e 's/chrY/chr24/' > hg38_new.fa
I tried running the art_illumina command that you asked. This is the stdout of the command:
====================ART====================
ART_Illumina (2008-2016)
Q Version 2.5.8 (June 6, 2016)
Contact: Weichun Huang <whduke@gmail.com>
-------------------------------------------
Warning: your simulation will not output any ALN or SAM file with your parameter settings!
Paired-end sequencing simulation
Total CPU time used: 246.379
The random seed for the run: 1677063070
Parameters used during run
Read Length: 100
Genome masking 'N' cutoff frequency: 1 in 100
Fold Coverage: 0.2X
Mean Fragment Length: 350
Standard Deviation: 20
Profile Type: Combined
ID Tag:
Quality Profile(s)
First Read: HiSeq 2000 Length 100 R1 (built-in profile)
First Read: HiSeq 2000 Length 100 R2 (built-in profile)
Output files
FASTQ Sequence Files:
the 1st reads: test1.fq
the 2nd reads: test2.fq
And these are the output files created:
(chisel) boffelli@fedora:test_art > head test*
==> test1.fq <==
@Ē%Ŏ,a;+�c�-50/1
9�̡(��H���3A�8^Q�I�7
3T���� =X��������{$*��GB3<�K�B��#�EH���R~!,�
IDZ�E��C��
+
?CCDFFDFFHHHCGJBGDI+GJIJEIGJI>CJJJ<GIHDJ?CEEHAGJFJJIHEGIJIFFJGCJEDB7HDIJE?A;BDHDDDC+3DDED?CCDDBB;DCE
@Ē%Ŏ,a;+�c�-44/1
NNNNNNNNNNNNNNANNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNN
+
""""""""""""""I""""":"""""""""""""""""""""""""""""""""J"""""""""""""""""""""""""""D"""""""""""""""""
@Ē%Ŏ,a;+�c�-38/1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
==> test2.fq <==
@Ē%Ŏ,a;+�c�-50/2
NNNNNNNNNNNNNNCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNN
+
""""""""""""""I"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""3"""""""""""""""""""""""
@Ē%Ŏ,a;+�c�-44/2
4��˽�M���Eҵ�&�����?*=��0��F��1�==C��^��
·��WD��<OǒS;]T�WE�3A-�`?R��6*�PY�;���~��*��R�6�~��
+
@CCD+D+DHH?BAJGGGJICJIFJHJCJ)EII<GJ=IJFDGGGH;IJFIFFEJJEFJHDJIJDJA?&J'C@DFEJHDHB;BFDADEDDCHA<DD+(D+DD
@Ē%Ŏ,a;+�c�-38/2
����DO� �{`��+Sܫ��T��)3+�C,O�#�%X���S1ܪ$T*TJ�<CL���˯��*7<�W�A_9BO�^�:�V";�C(*+�]J
��PO
For some reason it creates weird binary files. The same was happening in the CHISEL process, and the error would come when it tried to read these simulated files.
I hope you can figure it out!
Really nice software apart from this small problem!
The names of the chromosomes in the reference genome cannot be simply changed because it will not match the previously generated indexes, dictionaries, etc. Could you please try to run the same command but with an unchanged reference genome?
If you need to change the names of the chromosomes you will need to re-generate all the bwa
and samtools
indexes at least, using for example the commands
bwa index hg38.fa
samtools faidx hg38.fa
samtools dict hg38.fa > hg38.dict
Of course, I am aware of that. All my indexes were generated after changing the nomenclature. All steps previously mentioned were performed with the right indexes.
Hello,
I am trying to run chisel on whole genome sequencing data, but I keep getting this error in the alignment of simulated sequencing reads step:
The steps I did so far were:
The chisel command that I am using is this:
Where my bam file looks like this:
and my phases.tsv file looks like this:
This is the complete stdout of the command:
The demo complete-nonormal works perfectly. I would appreciate is someone could help me understand what I am doing wrong. Please let me know if you need more info!
Thank you in advance!
Arthur